Hi Charles,
Thank you for your advice. I have modified pbs.pm file, which fixed the
problem.
However, I encountered another problem:
in a job description file I specify how many nodes I want to request, e.g. 8
<count>8</count>
but I only get a half of that number.
All the compute nodes have 2 CPUs.
Right now (as I mentioned before) in pbs.pm file variable $cpu_per_node
= 1. If I change it to 2, then I get 1/4 of the number of requested
nodes (e.g. 8/4 = 2). Temporarily, I got a working solution, which is
not perfect, because I can never get more then a half of the total
number of nodes in the cluster.
Regards,
--
Daniel
Charles Bacon wrote:
On Nov 19, 2007, at 4:05 PM, Daniel Andrzejewski wrote:
I'd like to add that the name of the compute node is node10.local not
node10.local:1 as you can see in the error message.
So, possibly the PBS nodefile is coming out in a different format than
expected, thus causing trouble in the following loop:
hosts=\`cat \$PBS_NODEFILE\`;
counter=0
while test \$counter -lt $count; do
for host in \$hosts; do
if test \$counter -lt $count; then
$remote_shell \$host "/bin/sh $cmd_script_name" < $stdin &
counter=\`expr \$counter + 1\`
else
break
fi
done
done
That winds up in the submit file to PBS. You can add a line like:
system("cp $pbs_job_script_name /tmp/ws.gram.job");
right before the line reading:
chomp($job_id = `$qsub < $pbs_job_script_name $errfile`);
Then you can edit the /tmp/ws.gram.job file to see what fix is required.
Charles
The following is the piece of
${GLOBUS_LOCATION}/lib/perl/Globus/GRAM/JobManager/pbs.pm file
----------------------
my ($mpirun, $mpiexec, $qsub, $qstat, $qdel, $cluster, $cpu_per_node,
$remote_shell);
BEGIN
{
$mpiexec = 'no';
$mpirun = '/usr/local/bin/mpirun';
$qsub = '/usr/local/bin/qsub';
$qstat = '/usr/local/bin/qstat';
$qdel = '/usr/local/bin/qdel';
$cluster = 1;
$cpu_per_node = 1;
$remote_shell = '/usr/local/bin/ssh';
$softenv_dir = '';
$soft_msc = "$softenv_dir/bin/soft-msc";
$softenv_load = "$softenv_dir/etc/softenv-load.sh";
}
----------------------
If I change $cluster to 0 I don't get any errors, but I don't get as
many resources as I request (in a job description file, e.g.
<count>10</count>)
Thank you,
--Daniel
Charles Bacon wrote:
Your client sends its hostname to the container. Are you submitting
from a machine named node10.local? If so, you should set
GLOBUS_HOSTNAME to the publically visible name of your machine instead.
If myhost.com is really node10.local, then you should set
GLOBUS_HOSTNAME in its environment to its publically visible name.
Charles
On Nov 19, 2007, at 2:57 PM, Daniel Andrzejewski wrote:
Hi all,
When I submit the following job I get no problems, but no output
either.
globusrun-ws -submit -F
https://myhost.com:8443/wsrf/services/ManagedJobFactoryService -Ft
PBS -c /bin/ls
When I add -s option I get the following error:
ssh: node10.local:1: Name or service not known
/var/torque/mom_priv/jobs/179.myhost-head.SC: line 37: [: too many
arguments
I don't have any problems with ssh keys and I use Torque/Maui.
Thanks in advance.
--Daniel Andrzejewski
student IT Administrator
Electrical Engineering and Computer Science
University of Tennessee
(865) 974 - 4388 (work)
"Investment in knowledge always pays the best interest" Benjamin
Franklin
--