On 01/26/2016 11:22 PM, Reuti wrote:
Am 25.01.2016 um 23:32 schrieb Marlies Hankel <[email protected]>:

Hi all,

Thank you. So I really have to remove the node from SGE and then put it back 
in? Or is there an easier way?
There is no need to remove it from SGE. My idea was to shut down the execd, 
remove the particular spool directory for this job, start the execd again.


I checked the spool directory and there is nothing in there.
Okay, although for the running job you should see some directory structure 
therein where the path name components form the job id of the job.

No, there was nothing. The node now has a new job running on it and and the spool directory only has directories relating to this new job.
Also, funny enough the node does accept jobs if they are single node jobs. For 
example, there are two nodes free, node2 and node3, with 20 cores each and a 
user submits a parallel job asking for 40 cores it will not run saying the PE 
only offers 20 cores and with the error about node3 that a previous job needs 
to be cleaned up. But if the user submits 2 jobs each asking for 20 core they 
both run just fine.

However, I found something in the spool directory of the queue master. Here 
node3 is the only node that has entries in the reschedule_unkown_list while all 
other exechosts have NONE there. All of these job IDs belong to jobs that are 
either running fine or still queueing.
Ran this job before on node3 and were rescheduled already - are there any 
`qacct` entries for them?
Yes, see below. It seems that the node lost contact to the storage where the home directories are. The user has now deleted the jobs that were still in the queue and they have now disappeared from the reschedule list. I have now disabled node3 and will enable it again once the users other jobs have finished that are on the reschedule list. Or am I able to clean this up in another way?

Best wishes

Marlies

jobnumber    1864
taskid       undefined
account      sge
priority     0
qsub_time    Sun Jan 24 22:02:45 2016
start_time   -/-
end_time     -/-
granted_pe   orte
slots        40
failed       26  : opening input/output file
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined


-- Reuti


But I do not know from where this file is populated to be able to clear this 
list.

Best wishes

Marlies

[root@queue]# more /opt/gridengine/default/spool/qmaster/exec_hosts/node3

# Version: 2011.11p1
#
# DO NOT MODIFY THIS FILE MANUALLY!
#
hostname              node3
load_scaling          NONE
complex_values        h_vmem=120G
load_values           
arch=linuxx64,num_proc=20,mem_total=129150.914062M,swap_total=3999.996094M,virtual_total=133150.910156M
processors            20
reschedule_unknown_list 1864=1=8,1866=1=8,1867=1=8,1868=1=8,1871=1=8
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE


On 01/26/2016 04:13 AM, Reuti wrote:
Hi,

Am 25.01.2016 um 00:46 schrieb Marlies Hankel<[email protected]>:

Hi all,

Over the weekend something seems to have gone wrong with one of the nodes in 
our cluster. We get the error:

cannot run on host "cpu-1-3.local" until clean up of an previous run has 
finished


I have restarted the node but the error is still there. We are using OGS/Grid 
Engine 2011.11 as implemented in ROCKS 6.1.
How can I fix the node so jobs can run on it again?
Sometimes there are some remains of former jobw in the exechosts spool directory in the 
directory "jobs". But as you rebooted the node already we don't have to take 
care of other jobs running on this machine, hence it might be easier to remove the 
complete spool directory for this exechost, i.e. something like 
$SGE_ROOT/default/spool/node17. The directory node17 will be recreated automatically when 
the execd starts on node17.

-- Reuti


Also, does anyone know what might have caused this error to appear in the first 
place?

Best wishes and thank you in advance for your help

Marlies


--

------------------

Dr. Marlies Hankel
Research Fellow, Theory and Computation Group
Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
eResearch Analyst, Research Computing Centre and Queensland Cyber 
Infrastructure Foundation
The University of Queensland
Qld 4072, Brisbane, Australia
Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
Email: [email protected] | www.theory-computation.uq.edu.au


Notice: If you receive this e-mail by mistake, please notify me,
and do not make any use of its contents. I do not waive any
privilege, confidentiality or copyright associated with it. Unless
stated otherwise, this e-mail represents only the views of the
Sender and not the views of The University of Queensland.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--

ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms

Please note change of work hours: Monday, Wednesday and Friday

Dr. Marlies Hankel
Research Fellow
High Performance Computing, Quantum Dynamics&  Nanotechnology
Theory and Computational Molecular Sciences Group
Room 229 Australian Institute for Bioengineering and Nanotechnology  (75)
The University of Queensland
Qld 4072, Brisbane
Australia
Tel: +61 (0)7-33463996
Fax: +61 (0)7-334 63992
mobile:+61 (0)404262445
Email: [email protected]
http://web.aibn.uq.edu.au/cbn/

ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms

Notice: If you receive this e-mail by mistake, please notify me, and do
not make any use of its contents. I do not waive any privilege,
confidentiality or copyright associated with it. Unless stated
otherwise, this e-mail represents only the views of the Sender and not
the views of The University of Queensland.




--

------------------

Dr. Marlies Hankel
Research Fellow, Theory and Computation Group
Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
eResearch Analyst, Research Computing Centre and Queensland Cyber 
Infrastructure Foundation
The University of Queensland
Qld 4072, Brisbane, Australia
Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
Email: [email protected] | www.theory-computation.uq.edu.au


Notice: If you receive this e-mail by mistake, please notify me,
and do not make any use of its contents. I do not waive any
privilege, confidentiality or copyright associated with it. Unless
stated otherwise, this e-mail represents only the views of the
Sender and not the views of The University of Queensland.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to