Hadoop with EC2

2009-01-29 Thread FRANK MASHRAQI
Hello,

I am trying to setup a Hadoop cluster with EBS.  I cannot find where/how to
specify which EBS volumes to use for master/slaves in the scripts that are
in  hadoop-0.19.0/src/contrib/ec2  directory. I am able to setup the hadoop
cluster manually but would like to automate launch and termination of
cluster. It seems the scripts assume that one will use S3 for Hadoop. Can
anyone guide me on how to pass EBS volumes information to an instance when
it is launched?

Thanks,
Frank 


Re: Hadoop with EC2

2009-01-29 Thread Darren Govoni
You need to mount the EBS volume as a local filesystem and configure
hadoop to use the filesystem mount point as its data path.

On Thu, 2009-01-29 at 19:49 -0500, FRANK MASHRAQI wrote:
 Hello,
 
 I am trying to setup a Hadoop cluster with EBS.  I cannot find where/how to
 specify which EBS volumes to use for master/slaves in the scripts that are
 in  hadoop-0.19.0/src/contrib/ec2  directory. I am able to setup the hadoop
 cluster manually but would like to automate launch and termination of
 cluster. It seems the scripts assume that one will use S3 for Hadoop. Can
 anyone guide me on how to pass EBS volumes information to an instance when
 it is launched?
 
 Thanks,
 Frank 



hadoop 0.18.0 ec2 images?

2008-08-20 Thread Karl Anderson
Are there any publicly available EC2 images for Hadoop 0.18.0 yet?   
There don't seem to be any in the hadoop-ec2-images bucket.


Re: Hadoop on EC2 + S3 - best practice?

2008-07-01 Thread Tom White
Hi Tim,

The steps you outline look about right. Because your file is 5GB you
will need to use the S3 block file system, which has a s3 URL. (See
http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build
your own AMI unless you have dependencies that can't be submitted as a
part of the MapReduce job.

To read and write to S3 you can just use s3 URLs. Otherwise you can
use distcp to copy between S3 and HDFS before and after running your
job. This article I wrote has some more tips:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873

Hope that helps,

Tom

On Sat, Jun 28, 2008 at 10:24 AM, tim robertson
[EMAIL PROTECTED] wrote:
 Hi all,
 I have data in a file (150million lines at 100Gb or so) and have several
 MapReduce classes for my processing (custom index generation).

 Can someone please confirm the following is the best way to run on EC2 and
 S3 (both of which I am new to..)

 1) load my 100Gb file into S3
 2) create a class that will load the file from S3 and use as input to
 mapreduce (S3 not used during processing) and save output back to S3
 3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
 S3 input and the MR code) - I will base this on the public Hadoop AMI I
 guess
 4) run using the standard scripts

 Is this best practice?
 I assume this is pretty common... is there a better way where I can submit
 my Jar at runtime and just pass in the URL for the input and output files in
 S3?

 If not, has anyone an example that takes input from S3 and writes output to
 S3 also?

 Thanks for advice, or suggestions of best way to run.

 Tim



Hadoop on EC2 + S3 - best practice?

2008-06-28 Thread tim robertson
Hi all,
I have data in a file (150million lines at 100Gb or so) and have several
MapReduce classes for my processing (custom index generation).

Can someone please confirm the following is the best way to run on EC2 and
S3 (both of which I am new to..)

1) load my 100Gb file into S3
2) create a class that will load the file from S3 and use as input to
mapreduce (S3 not used during processing) and save output back to S3
3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
S3 input and the MR code) - I will base this on the public Hadoop AMI I
guess
4) run using the standard scripts

Is this best practice?
I assume this is pretty common... is there a better way where I can submit
my Jar at runtime and just pass in the URL for the input and output files in
S3?

If not, has anyone an example that takes input from S3 and writes output to
S3 also?

Thanks for advice, or suggestions of best way to run.

Tim


Re: hadoop on EC2

2008-06-04 Thread Steve Loughran

Andreas Kostyrka wrote:
Well, the basic trouble with EC2 is that clusters usually are not networks 
in the TCP/IP sense.


This makes it painful to decide which URLs should be resolved where.

Plus to make it even more painful, you cannot easily run it with one simple 
SOCKS server, because you need to defer DNS resolution to the inside the 
cluster, because VM names do resolve to external IPs, while the webservers 
we'd be all interested in reside on the internal 10/8 IPs.


Another fun item is that in many situations you will have multiple islands 
inside EC2 (the contractor working for multiple customers that have EC2 
deployments come to mind), so you cannot just route everything over one pipe 
into EC2.


My current setup relies on a very long list of -L ssh tunnel forwards plus 
iptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 get 
redirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 via 
ssh. (Implementation left as an exercise for the reader, or my ugly non-error 
checking script available on request :-P)


If one would want to have a more generic solution to redirect TCP ports via a 
ssh SOCKS tunnel (aka dynamic port forwarding), the following components 
would be needed:


-) a list of rules what gets forwarded where and how.
-) a DNS resolver that issues fake IP addresses to capture the name of the 
connected host.
-) a small forwarding script that checks the real destination IP to decide 
which IP address/port is being requested. (Hint: current Linux kernels don't 
use getsockname anymore, the real destination is carried nowadays as a socket 
option)


One of the uglier parts that I have found no real solution was the fact that 
one cannot be sure that ssh will be able to listen on a given port. 


Solutions I've found include:
-) check the port before issueing ssh (Racecondition warning: Going through 
this hole the whole federation star fleet could get lost.)

-) using some kind of except to drive ssh through a pty.
-) roll your own ssh tunnel solution. The only lib that come to my mind is 
Twisted, in which case one could ignore the need for the SOCKS protocol.


But luckily for us, the solution is easier, because we only need to tunnel 
http in the hadoop case, which has the high benefit that we do not need to 
capture the hostname, because http remembers the hostname inside the payload.


Do you worry/address the risk of someone like me bringing up a machine 
in the EC2 farm that then portscans all the near-neighbours in the 
address space for open hdfs data node/name node ports, and strikes up a 
conversation with your filesystem?




--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: hadoop on EC2

2008-06-03 Thread Andreas Kostyrka
Well, the basic trouble with EC2 is that clusters usually are not networks 
in the TCP/IP sense.

This makes it painful to decide which URLs should be resolved where.

Plus to make it even more painful, you cannot easily run it with one simple 
SOCKS server, because you need to defer DNS resolution to the inside the 
cluster, because VM names do resolve to external IPs, while the webservers 
we'd be all interested in reside on the internal 10/8 IPs.

Another fun item is that in many situations you will have multiple islands 
inside EC2 (the contractor working for multiple customers that have EC2 
deployments come to mind), so you cannot just route everything over one pipe 
into EC2.

My current setup relies on a very long list of -L ssh tunnel forwards plus 
iptables into the nat OUTPUT rule that make external-ip-of-vm1:50030 get 
redirected to localhost:SOMEPORT that is forwarded to name-of-vm1:50030 via 
ssh. (Implementation left as an exercise for the reader, or my ugly non-error 
checking script available on request :-P)

If one would want to have a more generic solution to redirect TCP ports via a 
ssh SOCKS tunnel (aka dynamic port forwarding), the following components 
would be needed:

-) a list of rules what gets forwarded where and how.
-) a DNS resolver that issues fake IP addresses to capture the name of the 
connected host.
-) a small forwarding script that checks the real destination IP to decide 
which IP address/port is being requested. (Hint: current Linux kernels don't 
use getsockname anymore, the real destination is carried nowadays as a socket 
option)

One of the uglier parts that I have found no real solution was the fact that 
one cannot be sure that ssh will be able to listen on a given port. 

Solutions I've found include:
-) check the port before issueing ssh (Racecondition warning: Going through 
this hole the whole federation star fleet could get lost.)
-) using some kind of except to drive ssh through a pty.
-) roll your own ssh tunnel solution. The only lib that come to my mind is 
Twisted, in which case one could ignore the need for the SOCKS protocol.

But luckily for us, the solution is easier, because we only need to tunnel 
http in the hadoop case, which has the high benefit that we do not need to 
capture the hostname, because http remembers the hostname inside the payload.

Not tested, but the following should work:
1.) Setup a proxy on the cluster somewhere. Make it do auth (proxy auth might 
work too, but depending upon how one makes the browser access the proxy this 
might be a bad idea).
2.) Make the client access the proxy for the needed hosts/port combinations. 
FoxyProxy or similiar extensions for firefox come to mind, or some 
destination nat rules on your packet firewall should do the trick.

Andreas


On Monday 02 June 2008 20:27:53 Chris K Wensel wrote:
  obviously this isn't the best solution if you need to let many semi
  trusted users browse your cluster.

 Actually, it would be much more secure if the tunnel service ran on a
 trusted server letting your users connect remotely via SOCKS and then
 browse the cluster. These users wouldn't need any AWS keys etc.


 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/




signature.asc
Description: This is a digitally signed message part.


Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel

if you use the new scripts in 0.17.0, just run

 hadoop-ec2 proxy cluster-name

this starts a ssh tunnel to your cluster.

installing foxy proxy in FF gives you whole cluster visibility..

obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.


On May 28, 2008, at 1:22 PM, Andreas Kostyrka wrote:


Hi!

I just wondered what other people use to access the hadoop webservers,
when running on EC2?

Ideas that I had:
1.) opening ports 50030 and so on = not good, data goes unprotected
over the internet. Even if I could enable some form of  
authentication it

would still plain http.

2.) Some kind of tunneling solution. The problem on this side is that
each of my cluster node is in a different subnet, plus the dualism
between the internal and external addresses of the nodes.

Any hints? TIA,

Andreas


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel


obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.



Actually, it would be much more secure if the tunnel service ran on a  
trusted server letting your users connect remotely via SOCKS and then  
browse the cluster. These users wouldn't need any AWS keys etc.



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: hadoop on EC2

2008-05-30 Thread Andreas Kostyrka
On Wednesday 28 May 2008 23:16:43 Chris Anderson wrote:
 Andreas,

 If you can ssh into the nodes, you can always set up port-forwarding
 with ssh -L to bring those ports to your local machine.

Yes, and the missing part is simple too: iptables with DNAT on OUTPUT :)

I even made a small ugly script for this kind of tunneling.

Andreas


 On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED] 
wrote:
  What I wonder is what ports do I need to access?
 
  50060 on all nodes.
  50030 on the jobtracker.
 
  Any other ports?
 
  Andreas
 
  Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
  On 5/28/08 1:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
   I just wondered what other people use to access the hadoop webservers,
   when running on EC2?
 
  While we don't run on EC2 :), we do protect the hadoop web processes
  by putting a proxy in front of it.  A user connects to the proxy,
  authenticates, and then gets the output from the hadoop process.  All of
  the redirection magic happens via a localhost connection, so no data is
  leaked unprotected.




create_tunnel
Description: application/python


hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
Hi!

I just wondered what other people use to access the hadoop webservers,
when running on EC2?

Ideas that I had:
1.) opening ports 50030 and so on = not good, data goes unprotected
over the internet. Even if I could enable some form of authentication it
would still plain http.

2.) Some kind of tunneling solution. The problem on this side is that
each of my cluster node is in a different subnet, plus the dualism
between the internal and external addresses of the nodes.

Any hints? TIA,

Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: hadoop on EC2

2008-05-28 Thread Jake Thompson
What is wron with opening up the ports only to the hosts that you want to
have access to them.  This is what I cam currently doing, -s 0.0.0.0/0 is
everyone everywhere so change it to -s my.ip.add.ress/32



On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka [EMAIL PROTECTED]
wrote:

 Hi!

 I just wondered what other people use to access the hadoop webservers,
 when running on EC2?

 Ideas that I had:
 1.) opening ports 50030 and so on = not good, data goes unprotected
 over the internet. Even if I could enable some form of authentication it
 would still plain http.

 2.) Some kind of tunneling solution. The problem on this side is that
 each of my cluster node is in a different subnet, plus the dualism
 between the internal and external addresses of the nodes.

 Any hints? TIA,

 Andreas



Re: hadoop on EC2

2008-05-28 Thread Allen Wittenauer



On 5/28/08 1:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 I just wondered what other people use to access the hadoop webservers,
 when running on EC2?

While we don't run on EC2 :), we do protect the hadoop web processes by
putting a proxy in front of it.  A user connects to the proxy,
authenticates, and then gets the output from the hadoop process.  All of the
redirection magic happens via a localhost connection, so no data is leaked
unprotected.



Re: hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
That presumes that you have a static source address. Plus for
nontechnical reasons changing the firewall rules is nontrivial.
(I'm responsible for the inside of the VMs, but somebody else holds the
ec2 keys, don't ask)

Andreas

Am Mittwoch, den 28.05.2008, 16:27 -0400 schrieb Jake Thompson:
 What is wron with opening up the ports only to the hosts that you want to
 have access to them.  This is what I cam currently doing, -s 0.0.0.0/0 is
 everyone everywhere so change it to -s my.ip.add.ress/32
 
 
 
 On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka [EMAIL PROTECTED]
 wrote:
 
  Hi!
 
  I just wondered what other people use to access the hadoop webservers,
  when running on EC2?
 
  Ideas that I had:
  1.) opening ports 50030 and so on = not good, data goes unprotected
  over the internet. Even if I could enable some form of authentication it
  would still plain http.
 
  2.) Some kind of tunneling solution. The problem on this side is that
  each of my cluster node is in a different subnet, plus the dualism
  between the internal and external addresses of the nodes.
 
  Any hints? TIA,
 
  Andreas
 


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: hadoop on EC2

2008-05-28 Thread Chris Anderson
Andreas,

If you can ssh into the nodes, you can always set up port-forwarding
with ssh -L to bring those ports to your local machine.

On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 What I wonder is what ports do I need to access?

 50060 on all nodes.
 50030 on the jobtracker.

 Any other ports?

 Andreas

 Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:


 On 5/28/08 1:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
  I just wondered what other people use to access the hadoop webservers,
  when running on EC2?

 While we don't run on EC2 :), we do protect the hadoop web processes by
 putting a proxy in front of it.  A user connects to the proxy,
 authenticates, and then gets the output from the hadoop process.  All of the
 redirection magic happens via a localhost connection, so no data is leaked
 unprotected.





-- 
Chris Anderson
http://jchris.mfdz.com


Re: hadoop on EC2

2008-05-28 Thread Ted Dunning

That doesn't work because the various web pages have links or redirects to
other pages on other machines.

Also, you would need to ssh to ALL of your cluster to get the file browser
to work.

Better to do the proxy thing.


On 5/28/08 2:16 PM, Chris Anderson [EMAIL PROTECTED] wrote:

 Andreas,
 
 If you can ssh into the nodes, you can always set up port-forwarding
 with ssh -L to bring those ports to your local machine.
 
 On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED]
 wrote:
 What I wonder is what ports do I need to access?
 
 50060 on all nodes.
 50030 on the jobtracker.
 
 Any other ports?
 
 Andreas
 
 Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:
 
 
 On 5/28/08 1:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 I just wondered what other people use to access the hadoop webservers,
 when running on EC2?
 
 While we don't run on EC2 :), we do protect the hadoop web processes by
 putting a proxy in front of it.  A user connects to the proxy,
 authenticates, and then gets the output from the hadoop process.  All of the
 redirection magic happens via a localhost connection, so no data is leaked
 unprotected.
 
 
 
 



Re: hadoop on EC2

2008-05-28 Thread Chris Anderson
On Wed, May 28, 2008 at 2:23 PM, Ted Dunning [EMAIL PROTECTED] wrote:

 That doesn't work because the various web pages have links or redirects to
 other pages on other machines.

 Also, you would need to ssh to ALL of your cluster to get the file browser
 to work.

True. That makes it a little impractical.


 Better to do the proxy thing.


This would be a nice addition to the Hadoop EC2 AMI (which is super
helpful, by the way). Thanks to whoever put it together.


-- 
Chris Anderson
http://jchris.mfdz.com


Re: hadoop on EC2

2008-05-28 Thread Jim R. Wilson
Recently I spent some time hacking the contrib/ec2 scripts to install
and configure OpenVPN on top of the other installed packages.  Our use
case required that all the slaves running mappers would need to
connect back through to our primary mysql database (firewalled as you
can imagine).  Simultaneously, our webservers had to be able to
connect to Hbase running atop the same Hadoop cluster via Thrift.

The scheme I eventually settled on was to have a server cert/key and a
client cert/key which would be shared across all the clients - then
make the master node the OpenVPN server, and have all the slave nodes
connect as clients.  Then, if any other box needed access to the
cluster (like our firewalled database and webservers), they'd connect
to the master hadoop node, whose EC2 group had UDP 1194 open to
0.0.0.0.  Such a client could then address any hadoop nodes by their
tunneled vpn IP (10.8.0.x), derived from their AMI instance start ID.

I almost had it all working - the only piece which was giving me
trouble was actually making the slaves connect back to the master at
instance boot time.  I could have figured it out, but got pulled off
because we decided to move away from ec2 for the time being :/

-- Jim R. Wilson (jimbojw)

On Wed, May 28, 2008 at 4:23 PM, Ted Dunning [EMAIL PROTECTED] wrote:

 That doesn't work because the various web pages have links or redirects to
 other pages on other machines.

 Also, you would need to ssh to ALL of your cluster to get the file browser
 to work.

 Better to do the proxy thing.


 On 5/28/08 2:16 PM, Chris Anderson [EMAIL PROTECTED] wrote:

 Andreas,

 If you can ssh into the nodes, you can always set up port-forwarding
 with ssh -L to bring those ports to your local machine.

 On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED]
 wrote:
 What I wonder is what ports do I need to access?

 50060 on all nodes.
 50030 on the jobtracker.

 Any other ports?

 Andreas

 Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer:


 On 5/28/08 1:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 I just wondered what other people use to access the hadoop webservers,
 when running on EC2?

 While we don't run on EC2 :), we do protect the hadoop web processes by
 putting a proxy in front of it.  A user connects to the proxy,
 authenticates, and then gets the output from the hadoop process.  All of 
 the
 redirection magic happens via a localhost connection, so no data is leaked
 unprotected.








Re: hadoop on EC2

2008-05-28 Thread Nate Carlson

On Wed, 28 May 2008, Andreas Kostyrka wrote:
1.) opening ports 50030 and so on = not good, data goes unprotected 
over the internet. Even if I could enable some form of authentication it 
would still plain http.


Personally, I set up an Apache server (with https and auth), and then set 
up cgiproxy[1] on it, and only allowed it to forward to the hadoop nodes. 
That way, when you click a link to go to a slave node's logs or whatnot it 
still works.  ;)


I no longer use Hadoop on ec2, but still use the config described above!

[1]http://www.jmarshall.com/tools/cgiproxy/


| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|



Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andrey Pankov

Hi,

Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It 
already contains such part:


echo Copying private key to slaves
for slave in `cat slaves`; do
  scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa
  ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa
  sleep 1
done

Anyway, did you tried hadoop-ec2 script? It works well for task you 
described.



Prasan Ary wrote:

Hi All,
  I have been trying to configure Hadoop on EC2 for large number of clusters ( 
100 plus). It seems that I have to copy EC2 private key to all the machines in 
the cluster so that they can have SSH connections.
  For now it seems I have to run a script to copy the key file to each of the 
EC2 instances. I wanted to know if there is a better way to accomplish this.
   
  Thanks,

  PA

   
-

Never miss a thing.   Make Yahoo your homepage.


---
Andrey Pankov


Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Tom White
Yes, this isn't ideal for larger clusters. There's a jira to address
this: https://issues.apache.org/jira/browse/HADOOP-2410.

Tom

On 20/03/2008, Prasan Ary [EMAIL PROTECTED] wrote:
 Hi All,
   I have been trying to configure Hadoop on EC2 for large number of clusters 
 ( 100 plus). It seems that I have to copy EC2 private key to all the machines 
 in the cluster so that they can have SSH connections.
   For now it seems I have to run a script to copy the key file to each of the 
 EC2 instances. I wanted to know if there is a better way to accomplish this.

   Thanks,

   PA



  -
  Never miss a thing.   Make Yahoo your homepage.


-- 
Blog: http://www.lexemetech.com/


Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andreas Kostyrka
Actually, I personally use the following 2 part copy technique to copy
files to a cluster of boxes:

tar cf - myfile | dsh -f host-list-file -i -c -M tar xCfv /tmp -

The first tar packages myfile into a tar file.

dsh runs a tar that unpacks the tar (in the above case all boxes listed
in host-list-file would have a /tmp/myfile after the command).

Tar options that are relevant include C (chdir) and v (verbose, can be
given twice) so you see what got copied.

dsh options that are relevant:
-i copy stdin to all ssh processes, requires -c
-c do the ssh calls concurrently.
-M prefix the out from the ssh with the hostname.

While this is not rsync, it has the benefit of being processed
concurrently, and quite flexible.

Andreas

Am Donnerstag, den 20.03.2008, 19:57 +0200 schrieb Andrey Pankov:
 Hi,
 
 Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It 
 already contains such part:
 
 echo Copying private key to slaves
 for slave in `cat slaves`; do
scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa
ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa
sleep 1
 done
 
 Anyway, did you tried hadoop-ec2 script? It works well for task you 
 described.
 
 
 Prasan Ary wrote:
  Hi All,
I have been trying to configure Hadoop on EC2 for large number of 
  clusters ( 100 plus). It seems that I have to copy EC2 private key to all 
  the machines in the cluster so that they can have SSH connections.
For now it seems I have to run a script to copy the key file to each of 
  the EC2 instances. I wanted to know if there is a better way to accomplish 
  this.
 
Thanks,
PA
  
 
  -
  Never miss a thing.   Make Yahoo your homepage.
 
 ---
 Andrey Pankov


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel

you can't do this with the contrib/ec2 scripts/ami.

but passing the master private dns name to the slaves on boot as 'user- 
data' works fine. when a slave starts, it contacts the master and  
joins the cluster. there isn't any need for a slave to rsync from the  
master, thus removing the dependency on them having the private key.  
and not using the start|stop-all scripts, you don't need to maintain  
the slaves file, and can thus lazily boot your cluster.


to do this, you will need to create your own AMI that works this way.  
not hard, just time consuming.


On Mar 20, 2008, at 11:56 AM, Prasan Ary wrote:

Chris,
 What do you mean when you say boot the slaves with the master  
private name ?



 ===

Chris K Wensel [EMAIL PROTECTED] wrote:
 I found it much better to start the master first, then boot the  
slaves

with the master private name.

i do not use the start|stop-all scrips, so i do not need to maintain
the slaves file. thus i don't need to push private keys around to
support those scripts.

this lets me start 20 nodes, then add 20 more later. or kill some.

btw, get ganglia installed. life will be better knowing what's going  
on.


also, setting up FoxyProxy on firefox lets you browse your whole
cluster if you setup a ssh tunnel (socks).

On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote:

Hi All,
I have been trying to configure Hadoop on EC2 for large number of
clusters ( 100 plus). It seems that I have to copy EC2 private key
to all the machines in the cluster so that they can have SSH
connections.
For now it seems I have to run a script to copy the key file to
each of the EC2 instances. I wanted to know if there is a better way
to accomplish this.

Thanks,
PA


-
Never miss a thing. Make Yahoo your homepage.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/






-
Looking for last minute shopping deals?  Find them fast with Yahoo!  
Search.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/