We hacked up the quartus DSE script to use SGE just like you did. Your requirements are significant and we have run into the same issue here. Users always want their jobs to start immediately and if you hold some reserve horse power to give an appearance of response, they want to know why all the nodes are not running their jobs.
Your requirements seem to be pretty tight. I'm not sure I caught all of
them. So forgive me if I am suggesting something that is a non-starter.
Just to clarify, if your cluster is full, (a slots spoken fore) you are
not expecting that some jobs get resubmitted correct in lieu of letting
new jobs into the run state?
If I understand correctly, then you are planning on only allowing a
maximum number of slots per invocation.
The reason I ask, is that we have wrappered DSE with yet another script
that does a couple things for us, including running simulation
regressions.
We can flip switches to make it a DSE only (with multiple levels of
DSE), or Sim Only, or some combination. In that script we do things
like select the queue we are going to submit to.
Originally, when I didn't know better, I did the following... which
seemed to work. It's just hard to manage, and there is the possibility
of a race state which can cause some screw ups. I never saw the race
state happen in the production environment, but I was able to induce it
if I tried hard enough.
Step 1. Create yourself a bunch of dse queues. Call them something
easily scriptable like dse01, dse02, ... dse15
Step 2. You have to limit the total number of DSE jobs in play in some
way, which could be a down and dirty as the number of slots available on
the machines, or you could create a 'users * queues dse to slots = XX
Here is thing I still don't know, because I went away from this, but I
always wondered if I created a main dse queue and applied the limit rule
to it, and made all the other queues subordinate, would that in fact
limit the total number of jobs from all DSE sources? In reality, I
think I just wrote it as 'users * queues {dse00, dse01, dse02 ....
dse15} to slots = 45'
Step 3 .So now, you have your global DSE limit, now you need to create a
"invocation limit". Do so by making a rule for each DSE queue, and
limit to 10 let say.
Now, in the wrapper script, I should have used a locking file for the
next step. If you are going to try this, look into it.
In a globally available location create a DSE queue file that has a
number in it 0 - 15. Tells you the next available to DSE queue to use.
Your new quartus DSE wrapper script should make a locking file, open
that queue file, increment the number, or roll it back to 0 if it is on
15 already, and then launch the DSE run using your subordinate queue.
It's a bit Rube Goldberg, but might achieve what you are looking for.
Course if you are going to go through all that trouble, then why not do
ticketing from a wrapper script
<<image/gif>>
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
