Re-starting one of the execd nodes solved the issue. I then found some jobs that I force deleted and the problem seems to have gone away.
Thanks. Simon On Sun, Mar 6, 2016 at 10:07 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 04.03.2016 um 16:40 schrieb Simon Matthews: > >> I am getting this error message: >> 03/04/2016 07:30:14|listen|sgemaster|E|commlib error: local host name >> error (remote rdata host name "turquoise" is not equal to local >> resolved host name "h2.sj.bps") >> 03/04/2016 >> 07:30:23|worker|sgemaster|E|cqueue_list_locate_qinstance("(null)@(null)"): >> cqueue == NULL("(null)", "(null)", 1, 0 >> 03/04/2016 07:30:23|worker|sgemaster|E|writing job finish information: >> can't locate queue "(null)@(null)" >> 03/04/2016 07:30:23|worker|sgemaster|W|job 9179498.1 failed on host >> <unknown host> before writing exit_status because: shepherd exited >> with exit status 19: before writing exit_status >> 03/04/2016 07:30:23|worker|sgemaster|C|!!!!!!!!!! got NULL element for >> QU_rerun !!!!!!!!!! >> >> I have seen references to this condition being fixed by deleting the >> job, but how do I do this? We use BDB spooling. This grid is running >> SGE 6.2U5. > > Is the job still running? It looks like it finished already. Nevertheless: > did you try a `qdel -f <job_id>`? > > -- Reuti _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users