-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Am 06.09.2017 um 19:22 schrieb Michael Stauffer: > On Wed, Sep 6, 2017 at 12:42 PM, Reuti <[email protected]> wrote: > > > Am 06.09.2017 um 17:33 schrieb Michael Stauffer <[email protected]>: > > > > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <[email protected]> wrote: > > It seems SGE master did not get refreshed with new hostgroup. Maybe you can > > try: > > > > 1. restart SGE master > > > > Is it safe to do this with jobs queued and running? I think it's not > > reliable, i.e. jobs can get killed and de-queued? > > Just to mention, that it's safe to restart the qmaster or reboot even the > machine the qmaster is running on. Nothing will happen to the running jobs on > the exechosts. > > OK good to know. I've done that before and seen them finish, although some > googling suggested people have seen jobs get killed. No. NB: They will get killed, in case you shut down the "sgeexecd" on an exechost with the conventional "stop" as argument though. Supplying the argument "softstop" instead will allow them to continue, although no longer being supervised by the "sgeexed" any longer. Sometimes this can be handy, in case a user gave an expected h_rt which is too short for the job and it's necessary to grant the job to continue to run. > Does a qmaster restart, however, empty the queue? No. > I imagine a reboot would too, unless the queue is stored in a file? All vital information is stored in flat files or BDB. The only thing which is lost, are the completed jobs aka zombies (which you can see with `qstat -s z`, the number of them can be set with `qconf -mconf` entry "finished_jobs"). - -- Reuti > > -M > > > -- Reuti > > > > or > > > > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local", > > wait till it gets refreshed; then change it back to "@basichosts". > > > > I've done this, but it's not refreshing (been about 10 minutes now). I'm > > still getting the error when I try to delete exec host compute-2-4, and > > qhost is still showing basic.q on the nodes in @basichosts. > > > > Interestingly, host compute-2-4 was removed from another queue > > (qlogin.basic.q) that also uses @basichosts, so it's something about > > basic.q that's stuck. > > > > Is there some way to refresh things other than restarting qmaster? > > > > -M > > > > > > > > > > > > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer <[email protected]> > > wrote: > > > SoGE 8.1.8 > > > > > > Hi, > > > > > > I'm having trouble deleting an execution host. I've removed it from the > > > host group, but when I try to delete with qconf, it says it's still part > > > of > > > 'basic.q'. Here's the relevant output? Anyone have any suggestions? > > > > > > [root@chead ~]# qconf -de compute-2-4.local > > > Host object "compute-2-4.local" is still referenced in cluster queue > > > "basic.q". > > > > > > [root@chead ~]# qconf -sq basic.q > > > qname basic.q > > > hostlist @basichosts > > > seq_no 0 > > > load_thresholds np_load_avg=1.74 > > > suspend_thresholds NONE > > > nsuspend 1 > > > suspend_interval 00:05:00 > > > priority 0 > > > min_cpu_interval 00:05:00 > > > processors UNDEFINED > > > qtype BATCH > > > ckpt_list NONE > > > pe_list make mpich mpi orte unihost serial > > > rerun FALSE > > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \ > > > [compute-1-1.local=7],[compute-1-3.local=7], \ > > > [compute-1-5.local=8],[compute-1-6.local=8], \ > > > [compute-1-7.local=8],[compute-1-8.local=8], \ > > > [compute-1-9.local=8],[compute-1-10.local=8], \ > > > [compute-1-11.local=8],[compute-1-12.local=8], \ > > > [compute-1-13.local=8],[compute-1-14.local=8], \ > > > [compute-1-15.local=8] > > > tmpdir /tmp > > > shell /bin/bash > > > prolog NONE > > > epilog NONE > > > shell_start_mode posix_compliant > > > starter_method NONE > > > suspend_method NONE > > > resume_method NONE > > > terminate_method NONE > > > notify 00:00:60 > > > owner_list NONE > > > user_lists NONE > > > xuser_lists NONE > > > subordinate_list NONE > > > complex_values NONE > > > projects NONE > > > xprojects NONE > > > calendar NONE > > > initial_state default > > > s_rt INFINITY > > > h_rt INFINITY > > > s_cpu INFINITY > > > h_cpu INFINITY > > > s_fsize INFINITY > > > h_fsize INFINITY > > > s_data INFINITY > > > h_data INFINITY > > > s_stack INFINITY > > > h_stack INFINITY > > > s_core INFINITY > > > h_core INFINITY > > > s_rss INFINITY > > > h_rss INFINITY > > > s_vmem 19G > > > h_vmem 19G > > > > > > [root@chead ~]# qconf -shgrp @basichosts > > > group_name @basichosts > > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \ > > > compute-1-5.local compute-1-6.local compute-1-7.local \ > > > compute-1-8.local compute-1-9.local compute-1-10.local \ > > > compute-1-11.local compute-1-12.local compute-1-13.local \ > > > compute-1-14.local compute-1-15.local compute-2-0.local \ > > > compute-2-2.local compute-2-5.local compute-2-7.local \ > > > compute-2-8.local compute-2-9.local compute-2-11.local \ > > > compute-2-12.local compute-2-13.local compute-2-15.local \ > > > compute-2-6.local > > > > > > Thanks > > > > > > -M > > > > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > -- > > Best, > > > > Feng > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iEYEARECAAYFAlmwOooACgkQo/GbGkBRnRo0eACgjv4C/9Jm9aJedEkFPVtwXRuo c7gAmgPcf27XTgd8SnjKMh2Hhz4gl5P2 =Tbbi -----END PGP SIGNATURE----- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
