Am 11.04.2015 um 14:02 schrieb Marlies Hankel: > Dear Reuti, > > No, I did not use ScaLAPACK for now.
Aha, I asked as I never got the ScLAPACK version of VASP running, only the traditional parallelization. > We do not have intelMPI and at the moment I needed to get things going to get > our new cluster up and usable. > > All our calculations are MPI based, not just VASP, and my own home grown code > does not run either through SGE, so I hope I can find the problem soon.... Does this happen to a simple mpihello application too? -- Reuti > Best wishes > > Marlies > > On 04/11/2015 07:40 PM, Reuti wrote: >> Am 11.04.2015 um 03:16 schrieb Marlies Hankel: >> >>> Dear all, >>> >>> Yes, I checked the paths and that looked ok. Also, I made sure that it >>> finds the right MPI version and vasp path etc. >>> >>> I do not think the h_vmem is the problem as I do not get any errors in the >>> queue logs for example. Also, in the end I change h_vmem to be not >>> consumable and I also asked for a lot and that made no difference. >>> >>> I will try and use a 1.6.5 openMPI version and see if that makes any >>> difference. >>> >>> Would the network scan cause SGE to abort the job? >> No. But there is a delay in startup. >> >> BTW: Are you using ScaLAPACK for VASP? >> >> -- Reuti >> >> >>> I do get some message about finding to IBs but I also get that when I run >>> interactively (ssh to node not via a qlogin). I have switched that off to >>> via mca to make sure this was not causing trouble. >>> >>> Best wishes >>> >>> Marlies >>> >>> >>> On 04/10/2015 08:12 PM, Reuti wrote: >>>>> Am 10.04.2015 um 04:51 schrieb Marlies Hankel<[email protected]>: >>>>> >>>>> Dear all, >>>>> >>>>> I have a ROCKS 6.1.1 install and I have also installed the SGE roll. So >>>>> the base config was done via the ROCKS install. The only changes I have >>>>> made are setting the h_vmem complex to consumable and setting up a >>>>> scratch complex. I have also set the h_vmem for all hosts. >>>> And the VASP job does work without h_vmem? We are using VASP too and have >>>> no problems with any set h_vmem. >>>> >>>> >>>>> I can run single CPU jobs fine and can execute simple things like >>>>> >>>>> mpirun -np 40 hostname >>>>> >>>>> but I cannot run proper MPI programs. I get the following error. >>>>> >>>>> mpirun noticed that process rank 0 with PID 27465 on node phi-0-3 exited >>>>> on signal 11 (Segmentation fault). >>>> Are you using the correct `mpiexec` also during execution of a job, i.e. >>>> between the nodes - maybe the interactive login has a different $PATH set >>>> than inside a job script? >>>> >>>> And if it's from Open MPI: was the application compiled with the same >>>> version of Open MPI which's `mpiexec` is used later on on all nodes? >>>> >>>> >>>>> Basically the queues error logs on the head node and the execution nodes >>>>> show nothing (/opt/gridengine/default/spool/../messages), also the .e, .o >>>>> and .pe, .po also show nothing. The above error is in the standard output >>>>> file of the program. I am trying VASP but have also tried a home grown >>>>> MPI code. Both of these have been running out of the box via SGE for >>>>> years on our old cluster (which was not ROCKS). I have tried the supplied >>>>> orte PE (programs are compiled with openmpi 1.8.4 >>>> The easiest would be to stay with Open MPI 1.6.5 as long as possible. In >>>> the 1.8 series they changed some things which might hinder a proper use: >>>> >>>> - The core binding is enabled by default in Open MPI 1.8. Having two MPI >>>> jobs on a node they may use the same cores and leave others idle. One can >>>> use "--bind-to none" and leave the binding of SGE in effect (if any). The >>>> behavior is different in that way, as SGE will give a job a set of cores, >>>> and the Linux scheduler is free to move the processes around inside this >>>> set. The native binding in Open MPI is per process (something SGE can't do >>>> of course, as Open MPI opens additional forks after the initial startup of >>>> `orted`. (Sure, the given set of cores by SGE could be rearranged to give >>>> this list to Open MPI). >>>> >>>> - Open MPI may scan the network before the actual jobs start to get all >>>> possible routes between the nodes. Depending on the network setup this may >>>> take 1-2 minutes. >>>> >>>> -- Reuti >>>> >>>> >>>>> compiled with intel and with --with-sge and --with-verbs) and have also >>>>> tried one where I specify catch rsh and startmpi and stopmpi scripts but >>>>> it made no difference. It seems as if the program does not even start. I >>>>> am not even trying to run over several nodes yet. >>>>> >>>>> Adding to that is that I can run the program (VASP) perfectly fine by ssh >>>>> to a node and just running from the command line. And also over several >>>>> nodes via a hostfile. So VASP itself is working fine. >>>>> >>>>> I had a look at env and made sure ulimits are set OK (need ulimit -s >>>>> unlimted for VASP to work) but all looks OK. >>>>> >>>>> Has anyone seen this problem before? Or do you have any suggestion on >>>>> what to do to get some info on where it actually goes wrong? >>>>> >>>>> Thanks in advance >>>>> >>>>> Marlies >>>>> -- >>>>> >>>>> ------------------ >>>>> >>>>> Dr. Marlies Hankel >>>>> Research Fellow, Theory and Computation Group >>>>> Australian Institute for Bioengineering and Nanotechnology (Bldg 75) >>>>> eResearch Analyst, Research Computing Centre and Queensland Cyber >>>>> Infrastructure Foundation >>>>> The University of Queensland >>>>> Qld 4072, Brisbane, Australia >>>>> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445 >>>>> Email: >>>>> [email protected] | www.theory-computation.uq.edu.au >>>>> >>>>> >>>>> >>>>> Notice: If you receive this e-mail by mistake, please notify me, >>>>> and do not make any use of its contents. I do not waive any >>>>> privilege, confidentiality or copyright associated with it. Unless >>>>> stated otherwise, this e-mail represents only the views of the >>>>> Sender and not the views of The University of Queensland. >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> -- >>> >>> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms >>> >>> Please note change of work hours: Monday, Wednesday and Friday >>> >>> Dr. Marlies Hankel >>> Research Fellow >>> High Performance Computing, Quantum Dynamics& Nanotechnology >>> Theory and Computational Molecular Sciences Group >>> Room 229 Australian Institute for Bioengineering and Nanotechnology (75) >>> The University of Queensland >>> Qld 4072, Brisbane >>> Australia >>> Tel: +61 (0)7-33463996 >>> Fax: +61 (0)7-334 63992 >>> mobile:+61 (0)404262445 >>> Email: [email protected] >>> http://web.aibn.uq.edu.au/cbn/ >>> >>> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms >>> >>> Notice: If you receive this e-mail by mistake, please notify me, and do >>> not make any use of its contents. I do not waive any privilege, >>> confidentiality or copyright associated with it. Unless stated >>> otherwise, this e-mail represents only the views of the Sender and not >>> the views of The University of Queensland. >>> >>> >>> > > -- > > ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms > > Please note change of work hours: Monday, Wednesday and Friday > > Dr. Marlies Hankel > Research Fellow > High Performance Computing, Quantum Dynamics& Nanotechnology > Theory and Computational Molecular Sciences Group > Room 229 Australian Institute for Bioengineering and Nanotechnology (75) > The University of Queensland > Qld 4072, Brisbane > Australia > Tel: +61 (0)7-33463996 > Fax: +61 (0)7-334 63992 > mobile:+61 (0)404262445 > Email: [email protected] > http://web.aibn.uq.edu.au/cbn/ > > ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms > > Notice: If you receive this e-mail by mistake, please notify me, and do > not make any use of its contents. I do not waive any privilege, > confidentiality or copyright associated with it. Unless stated > otherwise, this e-mail represents only the views of the Sender and not > the views of The University of Queensland. > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
