OK, I finally cracked the nut.  It was indeed an ssh issue, and the missing
piece was that the user had to be able to ssh to himself WITHIN THE SAME
NODE (!?!).  In my case the submitting user is "labkey" - it's understood
that lab...@[clientnode] needs to be able to ssh to lab...@[headnode] but it
turns out he also needs to be able to ssh to lab...@[clientnode].  This
seems odd to me, but that's how it is.  I suppose there might be a config
tweak for that somewhere.  Anyway, I just repeated the steps for
establishing ssh trust between lab...@clientnode and lab...@headnode for
lab...@clientnode and lab...@clientnode and it's all good.  One might have
guessed that this trust relationship was implicit, but it isn't - you have
to add labkey's rsa public key to ~labkey/.ssh/authorized_keys, and update
~labkey/.ssh/known_hosts to include our own hostname.

strace -f on the client node was instrumental in figuring this out, as well
as messing around in the perl scripts on the server node.  ssldump was
handy, too.

Thanks to Martin and Jim for the pointers.  If you're reading this in an
effort to solve a similar problem you might be interested to see my scripts
for configuring a simple globus+torque cluster on EC2 at
https://hedgehog.fhcrc.org/tor/stedi/trunk/AWS_EC2 .

Brian

On Fri, Dec 4, 2009 at 8:42 AM, Brian Pratt <brian.pr...@insilicos.com>wrote:

> Martin,
>
> Thanks for that tip and the link to some very useful notes.  I'd started
> poking around in that perl module last night and it looks like maybe the
> problem is actually to do with ssh between agents within the same globus
> node, so my ssh trust relationships are not yet quite as comprehensive as
> they need to be.  I will certainly post the solution here when I crack the
> nut.  I've found lots of posts out there of folks with similar sounding
> problems but no resolution, we'll try to fix that here.  Of course there are
> as many ways to go afoul as there are clusters, but we must leave bread
> crumbs where we can...
> Brian
>   On Thu, Dec 3, 2009 at 7:05 PM, Martin Feller <fel...@mcs.anl.gov>wrote:
>
>> Brian,
>>
>> The PBS job manager module is
>> $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
>>
>> I remember that I had this or a similar problem once too, but can't
>> seem to find notes about it (sad, i know).
>> Here's some information about the Perl code which is called by the Java
>> pieces of ws-gram to submit the job to the local resource manager.
>>
>> http://www.mcs.anl.gov/~feller/Globus/technicalStuff/Gram/perl/
>>
>> While this does not help directly, it may help in debugging.
>> If i find my notes or have a good idea I'll let you know.
>>
>> Martin
>>
>>
>>
>> Brian Pratt wrote:
>> > Good plan, thanks.  Now to figure out where that is..
>> >
>> > I'm certainly learning a lot!
>> >
>> > On Thu, Dec 3, 2009 at 2:01 PM, Jim Basney <jbas...@ncsa.uiuc.edu
>> > <mailto:jbas...@ncsa.uiuc.edu>> wrote:
>> >
>> >     It's been a long time since I've debugged a problem like this, but
>> the
>> >     way I did it in the old days was to modify the Globus PBS glue
>> script to
>> >     dump what it's passing to qsub, so I could reproduce it manually.
>> >
>> >     Brian Pratt wrote:
>> >     > Let me amend that - I do think that this is sniffing around the
>> >     right tree,
>> >     > which is why I said this is in some ways more of a logging
>> >     question.  It
>> >     > does look very much like an ssh issue, so what what I really need
>> >     is to
>> >     > figure out exactly what connection parameters were in use for the
>> >     failue.
>> >     > They seem to be different in some respect than those used in the
>> qsub
>> >     > transactions.  What I could really use is a hint at how to lay
>> >     eyes on that.
>> >     >
>> >     > Thanks,
>> >     >
>> >     > Brian
>> >     >
>> >     > On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt
>> >     <brian.pr...@insilicos.com <mailto:brian.pr...@insilicos.com
>> >>wrote:
>> >     >
>> >     >> Hi Jim,
>> >     >>
>> >     >> Thanks for the reply.  Unfortunately the answer doesn't seem to
>> >     be that
>> >     >> simple - I do have the ssh stuff worked out (believe me, I've
>> >     googled the
>> >     >> heck out of this thing!), the qsub test won't work without it.  I
>> >     can scp
>> >     >> between the two nodes in all combinations of user "globus" or
>> >     "labkey",
>> >     >> logged into either node, and in either direction.
>> >     >>
>> >     >> Thanks,
>> >     >>
>> >     >> Brian
>> >     >>
>> >     >>   On Thu, Dec 3, 2009 at 1:33 PM, Jim Basney
>> >     <jbas...@ncsa.uiuc.edu <mailto:jbas...@ncsa.uiuc.edu>>wrote:
>>  >     >>
>> >     >>> Hi Brian,
>> >     >>>
>> >     >>> "Host key verification failed" is an ssh client-side error. The
>> >     top hit
>> >     >>> from Google for this error message is
>> >     >>> <http://www.securityfocus.com/infocus/1806> which looks like a
>> good
>> >     >>> reference on the topic. I suspect you need to populate and
>> >     distribute
>> >     >>> /etc/ssh_known_hosts files between your nodes.
>> >     >>>
>> >     >>> -Jim
>> >     >>>
>> >     >>> Brian Pratt wrote:
>> >     >>>> Actually more of a logging question - I don't expect anyone to
>> >     solve the
>> >     >>>> problem by remote control, but I'm having a bit of trouble
>> >     figuring out
>> >     >>>> which node (server or client) the error is coming from.
>> >     >>>>
>> >     >>>> Here's the scenario: a node running
>> >     globus/ws-gram/pbs_server/pbs_sched
>> >     >>> and
>> >     >>>> one running pbs_mom. Using the globus simple ca.
>> >      Job-submitting user is
>> >     >>>> "labkey" on the globus node, and there's a labkey user on the
>> >     client
>> >     >>> node
>> >     >>>> too.
>> >     >>>>
>> >     >>>>  I can watch decrypted SSL traffic on the client node with
>> >     ssldump and
>> >     >>>> simpleca private key and can see the job script being handed to
>> the
>> >     >>> pbs_mom
>> >     >>>> node.
>> >     >>>>
>> >     >>>> passwordless ssh/scp is configured between the two nodes.
>> >     >>>>
>> >     >>>> job-submitting user's .globus directory is shared via nfs with
>> >     the mom
>> >     >>>> node.  UIDs agree on both nodes.  globus user can write to it.
>> >     >>>>
>> >     >>>>  Jobs submitted with qsub are fine. "qsub -o
>> >     >>>> ~labkey/globus_test/qsubtest_output.txt -e
>> >     >>>> ~labkey/globus_test/qsubtest_err.txt qsubtest"
>> >     >>>>  cat qsubtest
>> >     >>>>    #!/bin/bash
>> >     >>>>    date
>> >     >>>>    env
>> >     >>>>    logger "hello from qsubtest, I am $(whoami)"
>> >     >>>> and indeed it executes on the pbs_mom client node.
>> >     >>>>
>> >     >>>> Jobs submitted with fork are fine.  "globusrun-ws -submit -f
>> >     >>> gramtest_fork"
>> >     >>>>  cat gramtest_fork
>> >     >>>> <job>
>> >     >>>>   <executable>/mnt/userdata/gramtest_fork.sh</executable>
>> >     >>>>   <stdout>globus_test/gramtest_fork_stdout</stdout>
>> >     >>>>   <stderr>globus_test/gramtest_fork_stderr</stderr>
>> >     >>>> </job>
>> >     >>>> but those run local to the globus node, of course.
>> >     >>>>
>> >     >>>> But a job submitted as
>> >     >>>> globusrun-ws -submit -f gramtest_pbs -Ft PBS
>> >     >>>>
>> >     >>>> cat gramtest_pbs
>> >     >>>> <job>
>> >     >>>>   <executable>/usr/bin/env</executable>
>> >     >>>>   <stdout>gramtest_pbs_stdout</stdout>
>> >     >>>>   <stderr>gramtest_pbs_stderr</stderr>
>> >     >>>> </job>
>> >     >>>>
>> >     >>>> Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS
>> >     >>>> Host key verification failed.
>> >     >>>> /bin/touch: cannot touch
>> >     >>>>
>> >     `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0':
>> No
>> >     >>> such
>> >     >>>> file or directory
>> >     >>>> /var/spool/torque/mom_priv/jobs/
>> >     >>>> 
>> > 1.domu-12-31-38-00-b4-b5.compute-1.internal.SC<http://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/>
>> >     <http://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/><
>> http://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/>:
>>  >     >>> 59: cannot open
>> >     >>>>
>> >     /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No
>> >     >>> such
>> >     >>>> file
>> >     >>>> [: 59: !=: unexpected operator
>> >     >>>>
>> >     >>>> I'm stumped - what piece of the authentication picture am I
>> >     missing?
>> >     >>>  And
>> >     >>>> how to identify the actor that emitted that failure message?
>> >     >>>>
>> >     >>>> Thanks,
>> >     >>>>
>> >     >>>> Brian Pratt
>> >     >>
>> >     >
>> >
>> >
>>
>>
>

Reply via email to