On Tue, 9 Jun 2015 13:02:23 +0000
Dan Hyatt <[email protected]> wrote:

> 
> As my grid engine was stable and working...and that was more important 
> than anything else. I left it mostly running minimally administered for 
> a year.
> 
> When I went to add 6 more nodes to the grid, I inadvertently and 
> stupidly reconfigured the grid breaking it.
> 
> So I am trying to get the grid back to where my users expect it.
> 
> 
> I found what my problem was and a work around...
> 
> the command  was
> 
> qsub -cwd -b y metal < input.file > log.file  and it was choking
> 
> when I put   metal < input.file > log.file  into a script file, and ran
> qsub -cwd script.bash         it works fine
> 
> 
> 
> The next two issues that I have been googling and pouring over the 
> documentation....
> 


> When I send a job to a queue, if the queue is busy it sends it to the 
> next queue (defeating the purpose of separate queues in my env). How do 
> I set the queues to run jobs ONLY in the appointed queue?

You should be able to do this with qsub -q <queuename> <job script>.
If that isn't doing it then I suspect that either you are making soft 
requests somehow or a jsv is rewriting your request.  In any case a qstat -j
on the jobid should reveal what grid engine thinks the job requires.

> 
> The execute nodes were updated, and some are not playing well in the 
> sandbox. When the grid sends a job there, it hangs, sends an error but 
Not clear what hangs the job or the node?
What is the error that is being sent and  how is it being sent?

> does not remove that blade from the execute node list like it did before.
> Is there an easy way to manually test the execute nodes (there are 180), 
> and why is it not removing bad nodes from the available nodes as it did 
> before?  Before it would mark it unusable so when I list the execute 
> nodes I would see that the node was bad and it would not accept jobs.

Not clear what was marking the node or how  it was marking the node.

When a job dies as a result of some sort of error grid engine tries 
to figure out if the cause is likely the node or the job.  If the node it puts 
the 
appropriate queue instance into an error 'E' state.  If the job then it will 
put the 
job in an error state (Eqw,Erq or similar).  One possibility is that the errors 
are 
of a nature that grid engine takes for a job problem rather than a node 
problem. 
IIRC the exit status of the prolog can be used to set either the job or queue 
instance into an error state.  Possibly you had a prolog that detected problem 
nodes and has recently gone AWOL?

Possibly you had a load sensor and associated load_threshold that put the queues
into an alarmed ('a') state?  If that is the case you need to set them up again.

 
-- 
William Hay <[email protected]>

Attachment: pgpyuOv1SfP9W.pgp
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to