The client might be hanging when trying to connect to instances over
SSH. I'm not sure if jclouds has (or supports) timeouts for this
operation. If you see this situation again then a thread dump would be
very useful in diagnosing further.

Thanks,
Tom

On Thu, Mar 10, 2011 at 12:25 AM, Selwyn McCracken
<selwyn.mccrac...@gmail.com> wrote:
> Thanks Tom.
>
> Will build from the trunk tonight and give it a test (it does appear
> to be the same issue as WHIRR-167).
>
> The script hangs on the launch machine. I launched some smaller
> clusters, so hopefully this is the relevant section of the log
> displayed when I had to use Ctrl-Z to recover control of the terminal
> so I could destroy the cluster.
>
> --
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> dpkg-preconfigure: unable to re-open stdin:
>
> 2011-03-07 20:23:31,518 DEBUG [jclouds.compute] (user thread 11) <<
> options applied node(us-east-1/i-851d14e9)
> 2011-03-07 20:23:31,524 INFO
> [org.apache.whirr.cluster.actions.NodeStarter] (pool-1-thread-2) Nodes
> started: [[id=us-east-1/i-8b1d14e7, providerId=i-8b1d14e7,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.121.62],
> publicAddresses=[184.73.9.122], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-891d14e5, providerId=i-891d14e5,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.206.253],
> publicAddresses=[72.44.38.144], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-871d14eb, providerId=i-871d14eb,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.74.91],
> publicAddresses=[50.16.96.184], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-8d1d14e1, providerId=i-8d1d14e1,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.212.167.31],
> publicAddresses=[174.129.88.235], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-b11d14dd, providerId=i-b11d14dd,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.116.149.144],
> publicAddresses=[174.129.74.156], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-8f1d14e3, providerId=i-8f1d14e3,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.114.251.250],
> publicAddresses=[67.202.41.42], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-b31d14df, providerId=i-b31d14df,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.116.222.97],
> publicAddresses=[75.101.229.142], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}], [id=us-east-1/i-851d14e9, providerId=i-851d14e9,
> group=hadoop8l, name=null, location=[id=us-east-1b, scope=ZONE,
> description=us-east-1b, parent=us-east-1, iso3166Codes=[US-VA],
> metadata={}], uri=null, imageId=us-east-1/ami-da0cf8b3, os=[name=null,
> family=ubuntu, version=10.04, arch=paravirtual, is64Bit=true,
> description=ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20101020.manifest.xml],
> state=RUNNING, loginPort=22, privateAddresses=[10.116.222.165],
> publicAddresses=[50.16.23.148], hardware=[id=m1.large,
> providerId=m1.large, name=null, processors=[[cores=2.0, speed=2.0]],
> ram=7680, volumes=[[id=null, type=LOCAL, size=10.0, device=/dev/sda1,
> durable=false, isBootDevice=true], [id=null, type=LOCAL, size=420.0,
> device=/dev/sdb, durable=false, isBootDevice=false], [id=null,
> type=LOCAL, size=420.0, device=/dev/sdc, durable=false,
> isBootDevice=false]], supportsImage=is64Bit()], loginUser=ubuntu,
> userMetadata={}]]
>
> On Tue, Mar 8, 2011 at 11:39 PM, Tom White <tom.e.wh...@gmail.com> wrote:
>> Hi Selwyn,
>>
>> https://issues.apache.org/jira/browse/WHIRR-167 should improve
>> reliability of larger clusters, but it isn't in a released version yet
>> (it's in 0.4.0). You might try building trunk to see if it helps you.
>>
>> Where does the script hang? On the cloud instance or on the launch
>> machine? What's the last thing in the log?
>>
>> Adding nodes to a running cluster is still under development
>> (https://issues.apache.org/jira/browse/WHIRR-214).
>>
>> Cheers,
>> Tom
>>
>> On Mon, Mar 7, 2011 at 1:51 PM, Selwyn McCracken
>> <selwyn.mccrac...@gmail.com> wrote:
>>> Hi Whirrers,
>>>
>>> I have been successfully launching smaller clusters with whirr (<= 4
>>> data nodes).
>>>
>>> When I try to scale to something larger (8+ nodes), some of the nodes
>>> terminate during the startup process, and frequently it is the name
>>> node.
>>>
>>> I have reviewed the logs and there doesn't to be anything I can spot
>>> (in fact the whirr script hangs and never closes, so the log never
>>> completes).
>>>
>>> I suspect something is timing out if the cluster is being launched 
>>> serially...
>>>
>>> Has there been any progress made in adding nodes to an already running
>>> cluster? This might help to work around this problem, and make it
>>> easier for my benchmarking tests, where I am trying to show a linear
>>> decrease in processing time as the number of nodes increase. That is,
>>> I wont have to start a fresh cluster and reload the data into HDFS for
>>> each test run.
>>>
>>> Anyway, here is the recipe I have been using:
>>>
>>> whirr.cluster-name=hadoop8l
>>> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,8
>>> hadoop-datanode+hadoop-tasktracker
>>> whirr.hadoop-install-function=install_cdh_hadoop
>>> whirr.hadoop-configure-function=configure_cdh_hadoop
>>> whirr.provider=aws-ec2
>>> whirr.identity=${env:AWS_ACCESS_KEY_ID}
>>> whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
>>> whirr.hardware-id=m1.large
>>> whirr.image-id=us-east-1/ami-da0cf8b3
>>> whirr.location-id=us-east-1
>>>
>>> Any help greatly appreciated.
>>> Selwyn
>>>
>>
>

Reply via email to