On Wed, 7 Feb 2018 at 12:46am, William Hay wrote

On Tue, Feb 06, 2018 at 12:13:24PM -0800, Joshua Baker-LePain wrote:
I'm back again -- is it obvious that my new cluster just went into
production?  Again, we're running SoGE 8.1.9 on a cluster with nodes of
several different sizes.  We're running into an odd issue where SGE stops
scheduling jobs despite available slots.  The messages file contains many
instances of messages like this:

02/06/2018 12:03:41|worker|wynq1|E|not enough (1) free slots in queue 
"ondemand.q@cc-hmid1" for job 142497.1
02/06/2018 12:03:41|worker|wynq1|W|Skipping remaining 12 orders

Now, the project that 142497.1 (a 500 slot MPI job) belongs to cannot run in
the named queue instance -- an RQS limits the usage to 0 slots.  Also, if I
run "qalter -w p" on the job, it reports "verification: found possible
assignment with 500 slots".  But the job will never get scheduled.  And
neither will *any* other jobs.  The only way I've found to get things
flowing again is to stop and restart sgemaster.

Since it's possibly (probably?) related, I should say that I have
max_reservation set to 1024 in the scheduler config.  Also, I've had
instances of this error in the past where the queue@host instance mentioned
in the error is actually defined as having 0 slots.  So it's not tied to the
RQS.

Can anyone give me some pointers on how to debug this?  Thanks.

IIRC resource quotas and reservations don't always play nicely together.
The same error can come about for multiple different reasons so having
had this error in the past when the queue is defined as having 0 slots
doesn't eliminate RQS as a suspect.

I would set MONITOR=1 in the sched_conf and have a look at the schedule
file to see a little more detail about what is going on.

Here's what the my queue of waiting jobs currently looks like:
 153758 0.51149 tomography USER1       qw    02/08/2018 14:03:05                
                  192
 153759 0.00000 qss_svk_ge USER2       qw    02/08/2018 14:15:06                
                    1 1
 153760 0.00000 qss_svk_ge USER2       qw    02/08/2018 14:15:06                
                    1 1

with more jobs below that, all with 0.0000 priority. Starting at 14:03:06 in the messages file, I see this:

02/08/2018 14:03:06|worker|wynq1|E|not enough (1) free slots in queue 
"ondemand.q@cin-id3" for job 153758.1

And in the schedule file I see this:

153758:1:STARTING:1518127386:82860:P:mpi:slots:192.000000 153758:1:STARTING:1518127386:82860:H:msg-id19:mem_free:16106127360.000000 153758:1:STARTING:1518127386:82860:Q:member.q@msg-id19:slots:15.000000 153758:1:STARTING:1518127386:82860:L:member_queue_limits:/USER1lab////:15.000000 153758:1:STARTING:1518127386:82860:H:qb3-id1:mem_free:1073741824.000000 153758:1:STARTING:1518127386:82860:Q:ondemand.q@qb3-id1:slots:1.000000 153758:1:STARTING:1518127386:82860:L:ondemand_queue_limits:USER1/////:1.000000 153758:1:STARTING:1518127386:82860:H:qb3-id1:mem_free:11811160064.000000 153758:1:STARTING:1518127386:82860:Q:long.q@qb3-id1:slots:11.000000
.
.
.

Now, the relevant bits of the RQSes referenced in the above look like this:

   limit        projects {USER1lab,OTHERlab} queues member.q to slots=315
.
   limit        users {*} queues ondemand.q to slots=0

So why is it trying to give the job slots in ondemand.q?

As a slightly less drastic method than restarting the qmaster you could
try reducing the priority (qalter -p)  on the problem job for a scheduling
cycle to below the jobs stuck behind it to see if they will start even
if the problem job won't.

Even "qalter -p -1023 153758" didn't get things moving.

Now, *all* the entries regarding 153758 in the schedule file say "STARTING", not "RESERVING". Does that mean that reservations aren't coming into play here at all and it's entirely an RQS issue?

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to