> Am 27.01.2016 um 04:40 schrieb Marlies Hankel <[email protected]>:
> 
> On 01/26/2016 11:22 PM, Reuti wrote:
>>> Am 25.01.2016 um 23:32 schrieb Marlies Hankel <[email protected]>:
>>> 
>>> Hi all,
>>> 
>>> Thank you. So I really have to remove the node from SGE and then put it 
>>> back in? Or is there an easier way?
>> There is no need to remove it from SGE. My idea was to shut down the execd, 
>> remove the particular spool directory for this job, start the execd again.
>> 
>> 
>>> I checked the spool directory and there is nothing in there.
>> Okay, although for the running job you should see some directory structure 
>> therein where the path name components form the job id of the job.
>> 
> No, there was nothing. The node now has a new job running on it and and the 
> spool directory only has directories relating to this new job.

Fine.


>>> Also, funny enough the node does accept jobs if they are single node jobs. 
>>> For example, there are two nodes free, node2 and node3, with 20 cores each 
>>> and a user submits a parallel job asking for 40 cores it will not run 
>>> saying the PE only offers 20 cores and with the error about node3 that a 
>>> previous job needs to be cleaned up. But if the user submits 2 jobs each 
>>> asking for 20 core they both run just fine.
>>> 
>>> However, I found something in the spool directory of the queue master. Here 
>>> node3 is the only node that has entries in the reschedule_unkown_list while 
>>> all other exechosts have NONE there. All of these job IDs belong to jobs 
>>> that are either running fine or still queueing.
>> Ran this job before on node3 and were rescheduled already - are there any 
>> `qacct` entries for them?
> Yes, see below. It seems that the node lost contact to the storage where the 
> home directories are. The user has now deleted the jobs that were still in 
> the queue and they have now disappeared from the reschedule list. I have now 
> disabled node3 and will enable it again once the users other jobs have 
> finished that are on the reschedule list. Or am I able to clean this up in 
> another way?

Not that I'm aware of.

-- Reuti


> Best wishes
> 
> Marlies
> 
> jobnumber    1864
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Sun Jan 24 22:02:45 2016
> start_time   -/-
> end_time     -/-
> granted_pe   orte
> slots        40
> failed       26  : opening input/output file
> exit_status  0
> ru_wallclock 0
> ru_utime     0.000
> ru_stime     0.000
> ru_maxrss    0
> ru_ixrss     0
> ru_ismrss    0
> ru_idrss     0
> ru_isrss     0
> ru_minflt    0
> ru_majflt    0
> ru_nswap     0
> ru_inblock   0
> ru_oublock   0
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     0
> ru_nivcsw    0
> cpu          0.000
> mem          0.000
> io           0.000
> iow          0.000
> maxvmem      0.000
> arid         undefined
> 
>> 
>> -- Reuti
>> 
>> 
>>> But I do not know from where this file is populated to be able to clear 
>>> this list.
>>> 
>>> Best wishes
>>> 
>>> Marlies
>>> 
>>> [root@queue]# more /opt/gridengine/default/spool/qmaster/exec_hosts/node3
>>> 
>>> # Version: 2011.11p1
>>> #
>>> # DO NOT MODIFY THIS FILE MANUALLY!
>>> #
>>> hostname              node3
>>> load_scaling          NONE
>>> complex_values        h_vmem=120G
>>> load_values           
>>> arch=linuxx64,num_proc=20,mem_total=129150.914062M,swap_total=3999.996094M,virtual_total=133150.910156M
>>> processors            20
>>> reschedule_unknown_list 1864=1=8,1866=1=8,1867=1=8,1868=1=8,1871=1=8
>>> user_lists            NONE
>>> xuser_lists           NONE
>>> projects              NONE
>>> xprojects             NONE
>>> usage_scaling         NONE
>>> report_variables      NONE
>>> 
>>> 
>>> On 01/26/2016 04:13 AM, Reuti wrote:
>>>> Hi,
>>>> 
>>>>> Am 25.01.2016 um 00:46 schrieb Marlies Hankel<[email protected]>:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Over the weekend something seems to have gone wrong with one of the nodes 
>>>>> in our cluster. We get the error:
>>>>> 
>>>>> cannot run on host "cpu-1-3.local" until clean up of an previous run has 
>>>>> finished
>>>>> 
>>>>> 
>>>>> I have restarted the node but the error is still there. We are using 
>>>>> OGS/Grid Engine 2011.11 as implemented in ROCKS 6.1.
>>>>> How can I fix the node so jobs can run on it again?
>>>> Sometimes there are some remains of former jobw in the exechosts spool 
>>>> directory in the directory "jobs". But as you rebooted the node already we 
>>>> don't have to take care of other jobs running on this machine, hence it 
>>>> might be easier to remove the complete spool directory for this exechost, 
>>>> i.e. something like $SGE_ROOT/default/spool/node17. The directory node17 
>>>> will be recreated automatically when the execd starts on node17.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> Also, does anyone know what might have caused this error to appear in the 
>>>>> first place?
>>>>> 
>>>>> Best wishes and thank you in advance for your help
>>>>> 
>>>>> Marlies
>>>>> 
>>>>> 
>>>>> -- 
>>>>> 
>>>>> ------------------
>>>>> 
>>>>> Dr. Marlies Hankel
>>>>> Research Fellow, Theory and Computation Group
>>>>> Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
>>>>> eResearch Analyst, Research Computing Centre and Queensland Cyber 
>>>>> Infrastructure Foundation
>>>>> The University of Queensland
>>>>> Qld 4072, Brisbane, Australia
>>>>> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
>>>>> Email: [email protected] | www.theory-computation.uq.edu.au
>>>>> 
>>>>> 
>>>>> Notice: If you receive this e-mail by mistake, please notify me,
>>>>> and do not make any use of its contents. I do not waive any
>>>>> privilege, confidentiality or copyright associated with it. Unless
>>>>> stated otherwise, this e-mail represents only the views of the
>>>>> Sender and not the views of The University of Queensland.
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>> -- 
>>> 
>>> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms
>>> 
>>> Please note change of work hours: Monday, Wednesday and Friday
>>> 
>>> Dr. Marlies Hankel
>>> Research Fellow
>>> High Performance Computing, Quantum Dynamics&  Nanotechnology
>>> Theory and Computational Molecular Sciences Group
>>> Room 229 Australian Institute for Bioengineering and Nanotechnology  (75)
>>> The University of Queensland
>>> Qld 4072, Brisbane
>>> Australia
>>> Tel: +61 (0)7-33463996
>>> Fax: +61 (0)7-334 63992
>>> mobile:+61 (0)404262445
>>> Email: [email protected]
>>> http://web.aibn.uq.edu.au/cbn/
>>> 
>>> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms
>>> 
>>> Notice: If you receive this e-mail by mistake, please notify me, and do
>>> not make any use of its contents. I do not waive any privilege,
>>> confidentiality or copyright associated with it. Unless stated
>>> otherwise, this e-mail represents only the views of the Sender and not
>>> the views of The University of Queensland.
>>> 
>>> 
>>> 
> 
> -- 
> 
> ------------------
> 
> Dr. Marlies Hankel
> Research Fellow, Theory and Computation Group
> Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
> eResearch Analyst, Research Computing Centre and Queensland Cyber 
> Infrastructure Foundation
> The University of Queensland
> Qld 4072, Brisbane, Australia
> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
> Email: [email protected] | www.theory-computation.uq.edu.au
> 
> 
> Notice: If you receive this e-mail by mistake, please notify me,
> and do not make any use of its contents. I do not waive any
> privilege, confidentiality or copyright associated with it. Unless
> stated otherwise, this e-mail represents only the views of the
> Sender and not the views of The University of Queensland.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to