> Am 25.01.2016 um 23:32 schrieb Marlies Hankel <[email protected]>: > > Hi all, > > Thank you. So I really have to remove the node from SGE and then put it back > in? Or is there an easier way?
There is no need to remove it from SGE. My idea was to shut down the execd, remove the particular spool directory for this job, start the execd again. > I checked the spool directory and there is nothing in there. Okay, although for the running job you should see some directory structure therein where the path name components form the job id of the job. > Also, funny enough the node does accept jobs if they are single node jobs. > For example, there are two nodes free, node2 and node3, with 20 cores each > and a user submits a parallel job asking for 40 cores it will not run saying > the PE only offers 20 cores and with the error about node3 that a previous > job needs to be cleaned up. But if the user submits 2 jobs each asking for 20 > core they both run just fine. > > However, I found something in the spool directory of the queue master. Here > node3 is the only node that has entries in the reschedule_unkown_list while > all other exechosts have NONE there. All of these job IDs belong to jobs that > are either running fine or still queueing. Ran this job before on node3 and were rescheduled already - are there any `qacct` entries for them? -- Reuti > But I do not know from where this file is populated to be able to clear this > list. > > Best wishes > > Marlies > > [root@queue]# more /opt/gridengine/default/spool/qmaster/exec_hosts/node3 > > # Version: 2011.11p1 > # > # DO NOT MODIFY THIS FILE MANUALLY! > # > hostname node3 > load_scaling NONE > complex_values h_vmem=120G > load_values > arch=linuxx64,num_proc=20,mem_total=129150.914062M,swap_total=3999.996094M,virtual_total=133150.910156M > processors 20 > reschedule_unknown_list 1864=1=8,1866=1=8,1867=1=8,1868=1=8,1871=1=8 > user_lists NONE > xuser_lists NONE > projects NONE > xprojects NONE > usage_scaling NONE > report_variables NONE > > > On 01/26/2016 04:13 AM, Reuti wrote: >> Hi, >> >>> Am 25.01.2016 um 00:46 schrieb Marlies Hankel<[email protected]>: >>> >>> Hi all, >>> >>> Over the weekend something seems to have gone wrong with one of the nodes >>> in our cluster. We get the error: >>> >>> cannot run on host "cpu-1-3.local" until clean up of an previous run has >>> finished >>> >>> >>> I have restarted the node but the error is still there. We are using >>> OGS/Grid Engine 2011.11 as implemented in ROCKS 6.1. >>> How can I fix the node so jobs can run on it again? >> Sometimes there are some remains of former jobw in the exechosts spool >> directory in the directory "jobs". But as you rebooted the node already we >> don't have to take care of other jobs running on this machine, hence it >> might be easier to remove the complete spool directory for this exechost, >> i.e. something like $SGE_ROOT/default/spool/node17. The directory node17 >> will be recreated automatically when the execd starts on node17. >> >> -- Reuti >> >> >>> Also, does anyone know what might have caused this error to appear in the >>> first place? >>> >>> Best wishes and thank you in advance for your help >>> >>> Marlies >>> >>> >>> -- >>> >>> ------------------ >>> >>> Dr. Marlies Hankel >>> Research Fellow, Theory and Computation Group >>> Australian Institute for Bioengineering and Nanotechnology (Bldg 75) >>> eResearch Analyst, Research Computing Centre and Queensland Cyber >>> Infrastructure Foundation >>> The University of Queensland >>> Qld 4072, Brisbane, Australia >>> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445 >>> Email: [email protected] | www.theory-computation.uq.edu.au >>> >>> >>> Notice: If you receive this e-mail by mistake, please notify me, >>> and do not make any use of its contents. I do not waive any >>> privilege, confidentiality or copyright associated with it. Unless >>> stated otherwise, this e-mail represents only the views of the >>> Sender and not the views of The University of Queensland. >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > > -- > > ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms > > Please note change of work hours: Monday, Wednesday and Friday > > Dr. Marlies Hankel > Research Fellow > High Performance Computing, Quantum Dynamics& Nanotechnology > Theory and Computational Molecular Sciences Group > Room 229 Australian Institute for Bioengineering and Nanotechnology (75) > The University of Queensland > Qld 4072, Brisbane > Australia > Tel: +61 (0)7-33463996 > Fax: +61 (0)7-334 63992 > mobile:+61 (0)404262445 > Email: [email protected] > http://web.aibn.uq.edu.au/cbn/ > > ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms > > Notice: If you receive this e-mail by mistake, please notify me, and do > not make any use of its contents. I do not waive any privilege, > confidentiality or copyright associated with it. Unless stated > otherwise, this e-mail represents only the views of the Sender and not > the views of The University of Queensland. > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
