Hi Chris,

I suggest that you configure SlurmctldDebug to "verbose"; it will probably tell you why the decision is being made.

(BTW, is it possible that the user doesn't have an account on the compute nodes yet?)

Andy

On 04/03/2017 03:30 PM, Chris Woelkers - NOAA Affiliate wrote:
I am running a small HPC, only 24 nodes, via slurm and am having an
issue where one of the users is unable to submit any jobs.
The user is new and whenever a job is submitted it shows the "job
requeued in held state" state and is never actually ran. We have left
the job sitting for over three days and it does not start. We have
tried releasing the job and it does not start. Here are the log
entries after an attempted release:

[2017-04-03T19:16:24.173] sched: update_job: releasing hold for job_id
1938 uid 0
[2017-04-03T19:16:24.174] _slurm_rpc_update_job complete JobId=1938
uid=0 usec=375
[2017-04-03T19:16:24.919] sched: Allocate JobId=1938
NodeList=rhinonode[07-14] #CPUs=192
[2017-04-03T19:16:25.017] _slurm_rpc_requeue: Processing RPC:
REQUEST_JOB_REQUEUE from uid=0
[2017-04-03T19:16:25.035] Requeuing JobID=1938 State=0x0 NodeCnt=0

The user has the same permissions as the older users that can run jobs.
The script that is being run is a simple test script and no matter
where the output is redirected, an NFS mount(for our SAN), the local
home directory, or the tmp directory, the result is the same.

Any idea as to what might be happening?

Thanks,

Chris Woelkers
Caelum Research Corp.
Linux Server and Network Administrator
NOAA GLERL

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

Reply via email to