Thanks for the feedback, Brian! I'll add a hint about this to my notes.

Martin

Brian Pratt wrote:
> OK, I finally cracked the nut.  It was indeed an ssh issue, and
> the missing piece was that the user had to be able to ssh to himself
> WITHIN THE SAME NODE (!?!).  In my case the submitting user is "labkey"
> - it's understood that lab...@[clientnode <mailto:lab...@[clientnode>]
> needs to be able to ssh to lab...@[headnode <mailto:lab...@[headnode>]
> but it turns out he also needs to be able to ssh to lab...@[clientnode
> <mailto:lab...@[clientnode>].  This seems odd to me, but that's how it
> is.  I suppose there might be a config tweak for that somewhere. 
> Anyway, I just repeated the steps for establishing ssh trust between
> lab...@clientnode <mailto:lab...@clientnode> and lab...@headnode
> <mailto:lab...@headnode> for lab...@clientnode
> <mailto:lab...@clientnode> and lab...@clientnode
> <mailto:lab...@clientnode> and it's all good.  One might have guessed
> that this trust relationship was implicit, but it isn't - you have to
> add labkey's rsa public key to ~labkey/.ssh/authorized_keys, and update
> ~labkey/.ssh/known_hosts to include our own hostname.
>  
> strace -f on the client node was instrumental in figuring this out, as
> well as messing around in the perl scripts on the server node.  ssldump
> was handy, too.
>  
> Thanks to Martin and Jim for the pointers.  If you're reading this in an
> effort to solve a similar problem you might be interested to see my
> scripts for configuring a simple globus+torque cluster
> on EC2 at https://hedgehog.fhcrc.org/tor/stedi/trunk/AWS_EC2 .
>  
> Brian
> 
> On Fri, Dec 4, 2009 at 8:42 AM, Brian Pratt <brian.pr...@insilicos.com
> <mailto:brian.pr...@insilicos.com>> wrote:
> 
>     Martin,
>      
>     Thanks for that tip and the link to some very useful notes.  I'd
>     started poking around in that perl module last night and it looks
>     like maybe the problem is actually to do with ssh between agents
>     within the same globus node, so my ssh trust relationships are not
>     yet quite as comprehensive as they need to be.  I will certainly
>     post the solution here when I crack the nut.  I've found lots of
>     posts out there of folks with similar sounding problems but no
>     resolution, we'll try to fix that here.  Of course there are as many
>     ways to go afoul as there are clusters, but we must leave bread
>     crumbs where we can...
>     Brian
>     On Thu, Dec 3, 2009 at 7:05 PM, Martin Feller <fel...@mcs.anl.gov
>     <mailto:fel...@mcs.anl.gov>> wrote:
> 
>         Brian,
> 
>         The PBS job manager module is
>         $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
>         <http://pbs.pm/>
> 
>         I remember that I had this or a similar problem once too, but can't
>         seem to find notes about it (sad, i know).
>         Here's some information about the Perl code which is called by
>         the Java
>         pieces of ws-gram to submit the job to the local resource manager.
> 
>         http://www.mcs.anl.gov/~feller/Globus/technicalStuff/Gram/perl/
> 
>         While this does not help directly, it may help in debugging.
>         If i find my notes or have a good idea I'll let you know.
> 
>         Martin
> 
> 
> 
>         Brian Pratt wrote:
>         > Good plan, thanks.  Now to figure out where that is..
>         >
>         > I'm certainly learning a lot!
>         >
>         > On Thu, Dec 3, 2009 at 2:01 PM, Jim Basney
>         <jbas...@ncsa.uiuc.edu <mailto:jbas...@ncsa.uiuc.edu>
>         > <mailto:jbas...@ncsa.uiuc.edu <mailto:jbas...@ncsa.uiuc.edu>>>
>         wrote:
>         >
>         >     It's been a long time since I've debugged a problem like
>         this, but the
>         >     way I did it in the old days was to modify the Globus PBS
>         glue script to
>         >     dump what it's passing to qsub, so I could reproduce it
>         manually.
>         >
>         >     Brian Pratt wrote:
>         >     > Let me amend that - I do think that this is sniffing
>         around the
>         >     right tree,
>         >     > which is why I said this is in some ways more of a logging
>         >     question.  It
>         >     > does look very much like an ssh issue, so what what I
>         really need
>         >     is to
>         >     > figure out exactly what connection parameters were in
>         use for the
>         >     failue.
>         >     > They seem to be different in some respect than those
>         used in the qsub
>         >     > transactions.  What I could really use is a hint at how
>         to lay
>         >     eyes on that.
>         >     >
>         >     > Thanks,
>         >     >
>         >     > Brian
>         >     >
>         >     > On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt
>         >     <brian.pr...@insilicos.com
>         <mailto:brian.pr...@insilicos.com>
>         <mailto:brian.pr...@insilicos.com
>         <mailto:brian.pr...@insilicos.com>>>wrote:
>         >     >
>         >     >> Hi Jim,
>         >     >>
>         >     >> Thanks for the reply.  Unfortunately the answer doesn't
>         seem to
>         >     be that
>         >     >> simple - I do have the ssh stuff worked out (believe
>         me, I've
>         >     googled the
>         >     >> heck out of this thing!), the qsub test won't work
>         without it.  I
>         >     can scp
>         >     >> between the two nodes in all combinations of user
>         "globus" or
>         >     "labkey",
>         >     >> logged into either node, and in either direction.
>         >     >>
>         >     >> Thanks,
>         >     >>
>         >     >> Brian
>         >     >>
>         >     >>   On Thu, Dec 3, 2009 at 1:33 PM, Jim Basney
>         >     <jbas...@ncsa.uiuc.edu <mailto:jbas...@ncsa.uiuc.edu>
>         <mailto:jbas...@ncsa.uiuc.edu <mailto:jbas...@ncsa.uiuc.edu>>>wrote:
>         >     >>
>         >     >>> Hi Brian,
>         >     >>>
>         >     >>> "Host key verification failed" is an ssh client-side
>         error. The
>         >     top hit
>         >     >>> from Google for this error message is
>         >     >>> <http://www.securityfocus.com/infocus/1806> which
>         looks like a good
>         >     >>> reference on the topic. I suspect you need to populate and
>         >     distribute
>         >     >>> /etc/ssh_known_hosts files between your nodes.
>         >     >>>
>         >     >>> -Jim
>         >     >>>
>         >     >>> Brian Pratt wrote:
>         >     >>>> Actually more of a logging question - I don't expect
>         anyone to
>         >     solve the
>         >     >>>> problem by remote control, but I'm having a bit of
>         trouble
>         >     figuring out
>         >     >>>> which node (server or client) the error is coming from.
>         >     >>>>
>         >     >>>> Here's the scenario: a node running
>         >     globus/ws-gram/pbs_server/pbs_sched
>         >     >>> and
>         >     >>>> one running pbs_mom. Using the globus simple ca.
>         >      Job-submitting user is
>         >     >>>> "labkey" on the globus node, and there's a labkey
>         user on the
>         >     client
>         >     >>> node
>         >     >>>> too.
>         >     >>>>
>         >     >>>>  I can watch decrypted SSL traffic on the client node
>         with
>         >     ssldump and
>         >     >>>> simpleca private key and can see the job script being
>         handed to the
>         >     >>> pbs_mom
>         >     >>>> node.
>         >     >>>>
>         >     >>>> passwordless ssh/scp is configured between the two nodes.
>         >     >>>>
>         >     >>>> job-submitting user's .globus directory is shared via
>         nfs with
>         >     the mom
>         >     >>>> node.  UIDs agree on both nodes.  globus user can
>         write to it.
>         >     >>>>
>         >     >>>>  Jobs submitted with qsub are fine. "qsub -o
>         >     >>>> ~labkey/globus_test/qsubtest_output.txt -e
>         >     >>>> ~labkey/globus_test/qsubtest_err.txt qsubtest"
>         >     >>>>  cat qsubtest
>         >     >>>>    #!/bin/bash
>         >     >>>>    date
>         >     >>>>    env
>         >     >>>>    logger "hello from qsubtest, I am $(whoami)"
>         >     >>>> and indeed it executes on the pbs_mom client node.
>         >     >>>>
>         >     >>>> Jobs submitted with fork are fine.  "globusrun-ws
>         -submit -f
>         >     >>> gramtest_fork"
>         >     >>>>  cat gramtest_fork
>         >     >>>> <job>
>         >     >>>>   <executable>/mnt/userdata/gramtest_fork.sh</executable>
>         >     >>>>   <stdout>globus_test/gramtest_fork_stdout</stdout>
>         >     >>>>   <stderr>globus_test/gramtest_fork_stderr</stderr>
>         >     >>>> </job>
>         >     >>>> but those run local to the globus node, of course.
>         >     >>>>
>         >     >>>> But a job submitted as
>         >     >>>> globusrun-ws -submit -f gramtest_pbs -Ft PBS
>         >     >>>>
>         >     >>>> cat gramtest_pbs
>         >     >>>> <job>
>         >     >>>>   <executable>/usr/bin/env</executable>
>         >     >>>>   <stdout>gramtest_pbs_stdout</stdout>
>         >     >>>>   <stderr>gramtest_pbs_stderr</stderr>
>         >     >>>> </job>
>         >     >>>>
>         >     >>>> Gives this: cat globusrun-ws -submit -f gramtest_pbs
>         -Ft PBS
>         >     >>>> Host key verification failed.
>         >     >>>> /bin/touch: cannot touch
>         >     >>>>
>         >    
>         `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0':
>         No
>         >     >>> such
>         >     >>>> file or directory
>         >     >>>> /var/spool/torque/mom_priv/jobs/
>         >     >>>> 1.domu-12-31-38-00-b4-b5.compute-1.internal.SC
>         <http://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/>
>         >    
>         
> <http://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/><http://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/>:
>         >     >>> 59: cannot open
>         >     >>>>
>         >    
>         /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No
>         >     >>> such
>         >     >>>> file
>         >     >>>> [: 59: !=: unexpected operator
>         >     >>>>
>         >     >>>> I'm stumped - what piece of the authentication
>         picture am I
>         >     missing?
>         >     >>>  And
>         >     >>>> how to identify the actor that emitted that failure
>         message?
>         >     >>>>
>         >     >>>> Thanks,
>         >     >>>>
>         >     >>>> Brian Pratt
>         >     >>
>         >     >
>         >
>         >
> 
> 
> 

Reply via email to