Actually more of a logging question - I don't expect anyone to solve the
problem by remote control, but I'm having a bit of trouble figuring out
which node (server or client) the error is coming from.

Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched and
one running pbs_mom. Using the globus simple ca.  Job-submitting user is
"labkey" on the globus node, and there's a labkey user on the client node
too.

 I can watch decrypted SSL traffic on the client node with ssldump and
simpleca private key and can see the job script being handed to the pbs_mom
node.

passwordless ssh/scp is configured between the two nodes.

job-submitting user's .globus directory is shared via nfs with the mom
node.  UIDs agree on both nodes.  globus user can write to it.

 Jobs submitted with qsub are fine. "qsub -o
~labkey/globus_test/qsubtest_output.txt -e
~labkey/globus_test/qsubtest_err.txt qsubtest"
 cat qsubtest
   #!/bin/bash
   date
   env
   logger "hello from qsubtest, I am $(whoami)"
and indeed it executes on the pbs_mom client node.

Jobs submitted with fork are fine.  "globusrun-ws -submit -f gramtest_fork"
 cat gramtest_fork
<job>
  <executable>/mnt/userdata/gramtest_fork.sh</executable>
  <stdout>globus_test/gramtest_fork_stdout</stdout>
  <stderr>globus_test/gramtest_fork_stderr</stderr>
</job>
but those run local to the globus node, of course.

But a job submitted as
globusrun-ws -submit -f gramtest_pbs -Ft PBS

cat gramtest_pbs
<job>
  <executable>/usr/bin/env</executable>
  <stdout>gramtest_pbs_stdout</stdout>
  <stderr>gramtest_pbs_stderr</stderr>
</job>

Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS
Host key verification failed.
/bin/touch: cannot touch
`/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No such
file or directory
/var/spool/torque/mom_priv/jobs/
1.domu-12-31-38-00-b4-b5.compute-1.internal.SC: 59: cannot open
/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No such
file
[: 59: !=: unexpected operator

I'm stumped - what piece of the authentication picture am I missing?  And
how to identify the actor that emitted that failure message?

Thanks,

Brian Pratt

Reply via email to