Am 11.04.2015 um 03:16 schrieb Marlies Hankel: > Dear all, > > Yes, I checked the paths and that looked ok. Also, I made sure that it finds > the right MPI version and vasp path etc. > > I do not think the h_vmem is the problem as I do not get any errors in the > queue logs for example. Also, in the end I change h_vmem to be not consumable > and I also asked for a lot and that made no difference. > > I will try and use a 1.6.5 openMPI version and see if that makes any > difference. > > Would the network scan cause SGE to abort the job?
No. But there is a delay in startup. BTW: Are you using ScaLAPACK for VASP? -- Reuti > I do get some message about finding to IBs but I also get that when I run > interactively (ssh to node not via a qlogin). I have switched that off to via > mca to make sure this was not causing trouble. > > Best wishes > > Marlies > > > On 04/10/2015 08:12 PM, Reuti wrote: >>> Am 10.04.2015 um 04:51 schrieb Marlies Hankel<[email protected]>: >>> >>> Dear all, >>> >>> I have a ROCKS 6.1.1 install and I have also installed the SGE roll. So the >>> base config was done via the ROCKS install. The only changes I have made >>> are setting the h_vmem complex to consumable and setting up a scratch >>> complex. I have also set the h_vmem for all hosts. >> And the VASP job does work without h_vmem? We are using VASP too and have no >> problems with any set h_vmem. >> >> >>> I can run single CPU jobs fine and can execute simple things like >>> >>> mpirun -np 40 hostname >>> >>> but I cannot run proper MPI programs. I get the following error. >>> >>> mpirun noticed that process rank 0 with PID 27465 on node phi-0-3 exited on >>> signal 11 (Segmentation fault). >> Are you using the correct `mpiexec` also during execution of a job, i.e. >> between the nodes - maybe the interactive login has a different $PATH set >> than inside a job script? >> >> And if it's from Open MPI: was the application compiled with the same >> version of Open MPI which's `mpiexec` is used later on on all nodes? >> >> >>> Basically the queues error logs on the head node and the execution nodes >>> show nothing (/opt/gridengine/default/spool/../messages), also the .e, .o >>> and .pe, .po also show nothing. The above error is in the standard output >>> file of the program. I am trying VASP but have also tried a home grown MPI >>> code. Both of these have been running out of the box via SGE for years on >>> our old cluster (which was not ROCKS). I have tried the supplied orte PE >>> (programs are compiled with openmpi 1.8.4 >> The easiest would be to stay with Open MPI 1.6.5 as long as possible. In the >> 1.8 series they changed some things which might hinder a proper use: >> >> - The core binding is enabled by default in Open MPI 1.8. Having two MPI >> jobs on a node they may use the same cores and leave others idle. One can >> use "--bind-to none" and leave the binding of SGE in effect (if any). The >> behavior is different in that way, as SGE will give a job a set of cores, >> and the Linux scheduler is free to move the processes around inside this >> set. The native binding in Open MPI is per process (something SGE can't do >> of course, as Open MPI opens additional forks after the initial startup of >> `orted`. (Sure, the given set of cores by SGE could be rearranged to give >> this list to Open MPI). >> >> - Open MPI may scan the network before the actual jobs start to get all >> possible routes between the nodes. Depending on the network setup this may >> take 1-2 minutes. >> >> -- Reuti >> >> >>> compiled with intel and with --with-sge and --with-verbs) and have also >>> tried one where I specify catch rsh and startmpi and stopmpi scripts but it >>> made no difference. It seems as if the program does not even start. I am >>> not even trying to run over several nodes yet. >>> >>> Adding to that is that I can run the program (VASP) perfectly fine by ssh >>> to a node and just running from the command line. And also over several >>> nodes via a hostfile. So VASP itself is working fine. >>> >>> I had a look at env and made sure ulimits are set OK (need ulimit -s >>> unlimted for VASP to work) but all looks OK. >>> >>> Has anyone seen this problem before? Or do you have any suggestion on what >>> to do to get some info on where it actually goes wrong? >>> >>> Thanks in advance >>> >>> Marlies >>> -- >>> >>> ------------------ >>> >>> Dr. Marlies Hankel >>> Research Fellow, Theory and Computation Group >>> Australian Institute for Bioengineering and Nanotechnology (Bldg 75) >>> eResearch Analyst, Research Computing Centre and Queensland Cyber >>> Infrastructure Foundation >>> The University of Queensland >>> Qld 4072, Brisbane, Australia >>> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445 >>> Email: >>> [email protected] | www.theory-computation.uq.edu.au >>> >>> >>> >>> Notice: If you receive this e-mail by mistake, please notify me, >>> and do not make any use of its contents. I do not waive any >>> privilege, confidentiality or copyright associated with it. Unless >>> stated otherwise, this e-mail represents only the views of the >>> Sender and not the views of The University of Queensland. >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > -- > > ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms > > Please note change of work hours: Monday, Wednesday and Friday > > Dr. Marlies Hankel > Research Fellow > High Performance Computing, Quantum Dynamics& Nanotechnology > Theory and Computational Molecular Sciences Group > Room 229 Australian Institute for Bioengineering and Nanotechnology (75) > The University of Queensland > Qld 4072, Brisbane > Australia > Tel: +61 (0)7-33463996 > Fax: +61 (0)7-334 63992 > mobile:+61 (0)404262445 > Email: [email protected] > http://web.aibn.uq.edu.au/cbn/ > > ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms > > Notice: If you receive this e-mail by mistake, please notify me, and do > not make any use of its contents. I do not waive any privilege, > confidentiality or copyright associated with it. Unless stated > otherwise, this e-mail represents only the views of the Sender and not > the views of The University of Queensland. > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
