I am not quite sure. It seems that your AR (advance reservation) snapshot3 build is a bit new, and it may be a problem coming from it. I am not quite familiar with this new SGE feature. I'd ping the gridengine list to check on that error message coming from execd.

Neeraj Chourasia wrote:
Hello everyone,

I am facing problem while calling mpirun in a loop when using with SGE. My sge version is SGE6.1AR_snapshot3. The script i am submitting via sge is

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
let i=0

while [ $i -lt 100 ]
do
echo "############################################################################################"
        echo "Iteration :$i"
/usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile $TMP/machines send
        let "i+=1"
echo "############################################################################################"
done
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Now above script runs well for 15-20 iteration and then fails with following message

-------------------------Error Message------------------------------------------------------------------- error: executing task of job 3869 failed: execution daemon on host "n101" didn't accept task
[n199:11989] ERROR: A daemon on node n101 failed to start as expected.
[n199:11989] ERROR: There may be more information available from
[n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[n199:11989] ERROR: If the problem persists, please restart the
[n199:11989] ERROR: Grid Engine PE job
[n199:11989] ERROR: The daemon exited unexpectedly with status 1.
-----------------------------------------------------------------------------------------------------------

When i do ssh to n101, there is no orted and qrsh_starter running. While checking its spool file, i came across following message -----------------------------------------------Execd spool Error Message--------------------------------- |execd|n101|E|no free queue for job 3869 of user neeraj@n199 (localhost = n101)
-----------------------------------------------------------------------------------------------------------------------

What could be the reason for it.
While checking the mailing list, i come across following link
        http://www.open-mpi.org/community/lists/users/2007/03/2771.php
but, i dont think its the same problem. Any help is appreciated.

Regards
Neeraj




Singapore Tour <http://adworks.rediff.com/cgi-bin/AdWorks/click.cgi/www.rediff.com/signature-home.htm/1050715198@Middle5/2041799_2034533/2041733/1?PARTNER=3&OAS_QUERY=null>


------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

- Pak Lui
pak....@sun.com

Reply via email to