We're having a problem with submit scripts not being transferred to exec
nodes and jobs being stuck in the [t]ransitioning state.
The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.
We are using classic spooling. On the compute nodes, the spool directory
/var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/
exists, is owned by user 'sge' (running the execd), is writeable, and
has space.
There is successful communication between the qmaster and execd hosts:
qping works in both directions
jobs submitted as binaries (-b y) run correctly
directives from the master to the execd (for example, to delete jobs)
work
If I read the qmaster debug logs correctly, it looks like the qmaster isn't
able to send the submit script to the compute node:
1 worker001 debiting 8589934592.000000 of h_vmem on host
2115fmn001.foobar.local for 1 slots
2 worker001 debiting 4000000000.000000 of tmpfree on host
2115fmn001.foobar.local for 1 slots
3 worker001 debiting 1.000000 of jobs on queue all.q for 1 slots
4 worker001 debiting 1.000000 of slots on queue all.q for 1 slots
5 worker001 user doesn't match
6 worker001 user doesn't match
7 worker001 queue doesn't match
8 worker001 queue doesn't match
9 worker001 user doesn't match
10 worker001 user doesn't match
11 worker001 spooling job 9899430.1 <null>
12 worker001 Making dir "jobs/00/0989/9430/1-4096/1"
13 worker001 retval = 0
14 worker001 spooling job 9899430.1 <null>
15 worker001 Making dir "jobs/00/0989/9430"
16 worker001 retval = 0
17 worker001 TRIGGER JOB RESEND 9899430/1 in 300 seconds
18 worker001 successfully handed off job "9899430" to queue
"[email protected]"
19 worker001 NO TICKET DELIVERY
We don't see corresponding log messages on the client.
What mechanism is used by SGE to transfer submit scripts (something
specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?
What are the system-level requirements for succesfully sending the
submit scripts (for example: same UID for sge across the cluster, same
UID<->username for the user submitting the job across the cluster, etc)?
Any troubleshooting suggestions?
Thanks,
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users