[jira] [Commented] (WHIRR-378) Auth fail when creating a cluster from an EC2 instance

Paul Baclace (JIRA) Mon, 19 Sep 2011 16:57:34 -0700

    [ 
https://issues.apache.org/jira/browse/WHIRR-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108253#comment-13108253
 ]


Paul Baclace commented on WHIRR-378:
------------------------------------

I see this issue too (in 0.6.0), as far as I can tell from the description, but 
the upshot is some nodes are deleted as dead on arrival and more nodes are 
allocated so the cluster is successfully created.  BUT I am charged for 1 hour 
of time on each apparent DOA node.

One run and found that 2 out of 5 nodes were seemingly dead on arrival (I have 
many examples from the same day)  That is a high failure rate, so I wonder 
whether it was a false positive DOA.  A summary of the trimmed whirr.log below 
(last 3 digits of i-number):

1. starting 3 instances/nodes (fbe, fc0, fc2) at 3:37:19
2. problem with a node (fc2) at 3:38:46 or 87 sec. after node start
3. starting a new instance/node (01c) at 3:40:14
4. problem with a another node (01c) at 3:41:19, or 65sec after node start
5. start a new instance/node (040) at 3:41:22
6. delete nodes (01c, fc2) at 3:44:34

The most caused-by ssh error is "net.schmizz.sshj.userauth.UserAuthException: 
publickey auth failed".

It looks like the overall error "problem applying options to node" is occurring 
10 seconds after opening the socket, so that node is alive to some extent and 
it does not appear to be an ssh timeout.  That this happens about 1 minute 
after instance start makes me think there could be an implicit timer awaiting 
boot-up.  (These instances are all using the same private ami from 
instance-store and no EBS volumes.)

The failed nodes appear to be deleted after sufficient nodes are started up, 
not when they are determined to be failed.  Looking at billing records, I 
noticed that I *am* being charged for these failed nodes, so I think this is an 
important bug to fix. 


-----whirr.log excerpt-------
03:37:19,043 DEBUG [jclouds.compute]  << started instances([region=us-west-1, 
name=i-f9914fbe])
03:37:19,133 DEBUG [jclouds.compute]  << present instances([region=us-west-1, 
name=i-f9914fbe])
03:37:19,332 DEBUG [jclouds.compute]  << started instances([region=us-west-1, 
name=i-87914fc0],[region=us-west-1, name=i-85914fc2])
03:37:19,495 DEBUG [jclouds.compute]  << present instances([region=us-west-1, 
name=i-87914fc0],[region=us-west-1, name=i-85914fc2])

03:38:46,153 ERROR [jclouds.compute]  << problem applying options to 
node(us-west-1/i-85914fc2)

03:40:14,460 DEBUG [jclouds.compute]  << started instances([region=us-west-1, 
name=i-5b8e501c])
03:40:14,547 DEBUG [jclouds.compute]  << present instances([region=us-west-1, 
name=i-5b8e501c])

03:41:19,691 ERROR [jclouds.compute]  << problem applying options to 
node(us-west-1/i-5b8e501c)

03:41:22,738 DEBUG [jclouds.compute]  << started instances([region=us-west-1, 
name=i-078e5040])
03:41:22,831 DEBUG [jclouds.compute]  << present instances([region=us-west-1, 
name=i-078e5040])
03:44:34,257 INFO  [org.apache.whirr.actions.BootstrapClusterAction]  Deleting 
failed node node us-west-1/i-5b8e501c
03:44:34,259 INFO  [org.apache.whirr.actions.BootstrapClusterAction]  Deleting 
failed node node us-west-1/i-85914fc2
03:46:27,948 INFO  [org.apache.whirr.service.FileClusterStateStore] (main) 
Wrote instances file instances

The instances file ends up containing:   i-f9914fbe i-87914fc0 i-078e5040
And not containing: i-5b8e501c  i-85914fc2



> Auth fail when creating a cluster from an EC2 instance
> ------------------------------------------------------
>
>                 Key: WHIRR-378
>                 URL: https://issues.apache.org/jira/browse/WHIRR-378
>             Project: Whirr
>          Issue Type: Bug
>          Components: service/hadoop
>    Affects Versions: 0.6.0
>            Reporter: Marc de Palol
>
> There is a ssh auth problem when creating a hadoop cluster from an EC2 ubuntu 
> instance. 
> I've been using the same configuration file from an EC2 computer an a 
> physical one, everything works fine in the physical one, but I keep getting 
> this error in EC2: 
> Running configuration script on nodes: [us-east-1/i-c7fde5a6, 
> us-east-1/i-c9fde5a8, us-east-1/i-cbfde5aa]
> <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: 
> publickey auth failed
> <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: 
> publickey auth failed
> The user in the virtual machine is new and with valid .ssh keys.
> The hadoop config file is (omitting commented lines): 
> whirr.cluster-name=hadoop
> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 
> hadoop-datanode+hadoop-tasktracker
> whirr.provider=aws-ec2
> whirr.identity=****
> whirr.credential=****
> whirr.hardware-id=c1.xlarge
> whirr.image-id=us-east-1/ami-da0cf8b3
> whirr.location-id=us-east-1

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-378) Auth fail when creating a cluster from an EC2 instance

Reply via email to