As my grid engine was stable and working...and that was more important than anything else. I left it mostly running minimally administered for a year.

When I went to add 6 more nodes to the grid, I inadvertently and stupidly reconfigured the grid breaking it.

So I am trying to get the grid back to where my users expect it.


I found what my problem was and a work around...

the command  was

qsub -cwd -b y metal < input.file > log.file  and it was choking

when I put   metal < input.file > log.file  into a script file, and ran
qsub -cwd script.bash         it works fine



The next two issues that I have been googling and pouring over the documentation....

When I send a job to a queue, if the queue is busy it sends it to the next queue (defeating the purpose of separate queues in my env). How do I set the queues to run jobs ONLY in the appointed queue?

The execute nodes were updated, and some are not playing well in the sandbox. When the grid sends a job there, it hangs, sends an error but does not remove that blade from the execute node list like it did before. Is there an easy way to manually test the execute nodes (there are 180), and why is it not removing bad nodes from the available nodes as it did before? Before it would mark it unusable so when I list the execute nodes I would see that the node was bad and it would not accept jobs.


On 06/08/2015 02:10 PM, Alex Chekholko wrote:
What was the "grid reconfiguration"?

On 06/08/2015 11:42 AM, Dan Hyatt wrote:

We are running a binary program called metaanalysis, which the user says
was working prior to a grid reconfiguration.


qsub -cwd -b y /dsg_cent/bin/metal < c22srcfile.txt > c22SBP.log

This starts, runs, creates the logs, and then fails to create the data
files
qsub -cwd -b y  /dsg_cent/bin/metal < c22srcfile.txt > c22SBP.log

-rw-rw-r-- 1 aldi   genetics 8523209 Jun  8 09:53 c22GENOA.SBP.EA.M1.csv
-rw-rw-r-- 1 aldi   genetics 8660667 Jun  8 09:53 c22FamHS.SBP.ea.M1.csv
-rw-rw-r-- 1 aldi genetics 6025412 Jun 8 09:53 c22HYPERGEN.SBP.EA.M1.csv
-rw-rw-r-- 1 aldi   genetics    2061 Jun  8 09:53 c22srcfile.txt
-rw-rw-r-- 1 dhyatt genetics      43 Jun  8 13:40 c22SBP.log
-rw-r--r-- 1 dhyatt genetics       0 Jun  8 13:40 metal.e1043
-rw-r--r-- 1 dhyatt genetics    2743 Jun  8 13:40 metal.o1043
[dhyatt@blade5-2-1 c22

  the control/output file indicates everything runs there are .o and .e
files, but no data


The command line works fine, and creates the data files. But I need to
run large jobs on the queue

-rw-rw-r-- 1 aldi genetics 8523209 Jun 8 09:53 c22GENOA.SBP.EA.M1.csv -rw-rw-r-- 1 aldi genetics 8660667 Jun 8 09:53 c22FamHS.SBP.ea.M1.csv
-rw-rw-r-- 1 aldi   genetics  6025412 Jun  8 09:53
c22HYPERGEN.SBP.EA.M1.csv
-rw-rw-r-- 1 aldi   genetics     2061 Jun  8 09:53 c22srcfile.txt
-rw-rw-r-- 1 dhyatt genetics  8177082 Jun  8 13:39 METAANALYSIS1.TBL
-rw-rw-r-- 1 dhyatt genetics 1054 Jun 8 13:39 METAANALYSIS1.TBL.info
-rw-rw-r-- 1 dhyatt genetics 10487038 Jun  8 13:39 METAANALYSIS2.TBL
-rw-rw-r-- 1 dhyatt genetics 1316 Jun 8 13:39 METAANALYSIS2.TBL.info
-rw-rw-r-- 1 dhyatt genetics     5030 Jun  8 13:39 c22SBP.log

any thoughts?

Dan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to