> Am 27.01.2016 um 04:40 schrieb Marlies Hankel <[email protected]>: > > On 01/26/2016 11:22 PM, Reuti wrote: >>> Am 25.01.2016 um 23:32 schrieb Marlies Hankel <[email protected]>: >>> >>> Hi all, >>> >>> Thank you. So I really have to remove the node from SGE and then put it >>> back in? Or is there an easier way? >> There is no need to remove it from SGE. My idea was to shut down the execd, >> remove the particular spool directory for this job, start the execd again. >> >> >>> I checked the spool directory and there is nothing in there. >> Okay, although for the running job you should see some directory structure >> therein where the path name components form the job id of the job. >> > No, there was nothing. The node now has a new job running on it and and the > spool directory only has directories relating to this new job.
Fine. >>> Also, funny enough the node does accept jobs if they are single node jobs. >>> For example, there are two nodes free, node2 and node3, with 20 cores each >>> and a user submits a parallel job asking for 40 cores it will not run >>> saying the PE only offers 20 cores and with the error about node3 that a >>> previous job needs to be cleaned up. But if the user submits 2 jobs each >>> asking for 20 core they both run just fine. >>> >>> However, I found something in the spool directory of the queue master. Here >>> node3 is the only node that has entries in the reschedule_unkown_list while >>> all other exechosts have NONE there. All of these job IDs belong to jobs >>> that are either running fine or still queueing. >> Ran this job before on node3 and were rescheduled already - are there any >> `qacct` entries for them? > Yes, see below. It seems that the node lost contact to the storage where the > home directories are. The user has now deleted the jobs that were still in > the queue and they have now disappeared from the reschedule list. I have now > disabled node3 and will enable it again once the users other jobs have > finished that are on the reschedule list. Or am I able to clean this up in > another way? Not that I'm aware of. -- Reuti > Best wishes > > Marlies > > jobnumber 1864 > taskid undefined > account sge > priority 0 > qsub_time Sun Jan 24 22:02:45 2016 > start_time -/- > end_time -/- > granted_pe orte > slots 40 > failed 26 : opening input/output file > exit_status 0 > ru_wallclock 0 > ru_utime 0.000 > ru_stime 0.000 > ru_maxrss 0 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 0 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 0 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 0 > ru_nivcsw 0 > cpu 0.000 > mem 0.000 > io 0.000 > iow 0.000 > maxvmem 0.000 > arid undefined > >> >> -- Reuti >> >> >>> But I do not know from where this file is populated to be able to clear >>> this list. >>> >>> Best wishes >>> >>> Marlies >>> >>> [root@queue]# more /opt/gridengine/default/spool/qmaster/exec_hosts/node3 >>> >>> # Version: 2011.11p1 >>> # >>> # DO NOT MODIFY THIS FILE MANUALLY! >>> # >>> hostname node3 >>> load_scaling NONE >>> complex_values h_vmem=120G >>> load_values >>> arch=linuxx64,num_proc=20,mem_total=129150.914062M,swap_total=3999.996094M,virtual_total=133150.910156M >>> processors 20 >>> reschedule_unknown_list 1864=1=8,1866=1=8,1867=1=8,1868=1=8,1871=1=8 >>> user_lists NONE >>> xuser_lists NONE >>> projects NONE >>> xprojects NONE >>> usage_scaling NONE >>> report_variables NONE >>> >>> >>> On 01/26/2016 04:13 AM, Reuti wrote: >>>> Hi, >>>> >>>>> Am 25.01.2016 um 00:46 schrieb Marlies Hankel<[email protected]>: >>>>> >>>>> Hi all, >>>>> >>>>> Over the weekend something seems to have gone wrong with one of the nodes >>>>> in our cluster. We get the error: >>>>> >>>>> cannot run on host "cpu-1-3.local" until clean up of an previous run has >>>>> finished >>>>> >>>>> >>>>> I have restarted the node but the error is still there. We are using >>>>> OGS/Grid Engine 2011.11 as implemented in ROCKS 6.1. >>>>> How can I fix the node so jobs can run on it again? >>>> Sometimes there are some remains of former jobw in the exechosts spool >>>> directory in the directory "jobs". But as you rebooted the node already we >>>> don't have to take care of other jobs running on this machine, hence it >>>> might be easier to remove the complete spool directory for this exechost, >>>> i.e. something like $SGE_ROOT/default/spool/node17. The directory node17 >>>> will be recreated automatically when the execd starts on node17. >>>> >>>> -- Reuti >>>> >>>> >>>>> Also, does anyone know what might have caused this error to appear in the >>>>> first place? >>>>> >>>>> Best wishes and thank you in advance for your help >>>>> >>>>> Marlies >>>>> >>>>> >>>>> -- >>>>> >>>>> ------------------ >>>>> >>>>> Dr. Marlies Hankel >>>>> Research Fellow, Theory and Computation Group >>>>> Australian Institute for Bioengineering and Nanotechnology (Bldg 75) >>>>> eResearch Analyst, Research Computing Centre and Queensland Cyber >>>>> Infrastructure Foundation >>>>> The University of Queensland >>>>> Qld 4072, Brisbane, Australia >>>>> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445 >>>>> Email: [email protected] | www.theory-computation.uq.edu.au >>>>> >>>>> >>>>> Notice: If you receive this e-mail by mistake, please notify me, >>>>> and do not make any use of its contents. I do not waive any >>>>> privilege, confidentiality or copyright associated with it. Unless >>>>> stated otherwise, this e-mail represents only the views of the >>>>> Sender and not the views of The University of Queensland. >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> -- >>> >>> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms >>> >>> Please note change of work hours: Monday, Wednesday and Friday >>> >>> Dr. Marlies Hankel >>> Research Fellow >>> High Performance Computing, Quantum Dynamics& Nanotechnology >>> Theory and Computational Molecular Sciences Group >>> Room 229 Australian Institute for Bioengineering and Nanotechnology (75) >>> The University of Queensland >>> Qld 4072, Brisbane >>> Australia >>> Tel: +61 (0)7-33463996 >>> Fax: +61 (0)7-334 63992 >>> mobile:+61 (0)404262445 >>> Email: [email protected] >>> http://web.aibn.uq.edu.au/cbn/ >>> >>> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms >>> >>> Notice: If you receive this e-mail by mistake, please notify me, and do >>> not make any use of its contents. I do not waive any privilege, >>> confidentiality or copyright associated with it. Unless stated >>> otherwise, this e-mail represents only the views of the Sender and not >>> the views of The University of Queensland. >>> >>> >>> > > -- > > ------------------ > > Dr. Marlies Hankel > Research Fellow, Theory and Computation Group > Australian Institute for Bioengineering and Nanotechnology (Bldg 75) > eResearch Analyst, Research Computing Centre and Queensland Cyber > Infrastructure Foundation > The University of Queensland > Qld 4072, Brisbane, Australia > Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445 > Email: [email protected] | www.theory-computation.uq.edu.au > > > Notice: If you receive this e-mail by mistake, please notify me, > and do not make any use of its contents. I do not waive any > privilege, confidentiality or copyright associated with it. Unless > stated otherwise, this e-mail represents only the views of the > Sender and not the views of The University of Queensland. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
