Re: Hadoop on EC2 + S3 - best practice?

2008-07-01 Thread tim robertson
Hi Tom, Thanks for the reply, and after posting I found your blogs and followed your instructions - thanks There were a couple of gotchya's 1) My had a / in it and the escaping does not work 2) I copied to the root directory in the S3 bucket and I could not manage to get it out again using a dist

Re: Hadoop on EC2 + S3 - best practice?

2008-07-01 Thread Tom White
Hi Tim, The steps you outline look about right. Because your file is >5GB you will need to use the S3 block file system, which has a s3 URL. (See http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build your own AMI unless you have dependencies that can't be submitted as a part of the M

Hadoop on EC2 + S3 - best practice?

2008-06-28 Thread tim robertson
Hi all, I have data in a file (150million lines at 100Gb or so) and have several MapReduce classes for my processing (custom index generation). Can someone please confirm the following is the best way to run on EC2 and S3 (both of which I am new to..) 1) load my 100Gb file into S3 2) create a cla

Re: hadoop on EC2

2008-06-04 Thread Chris K Wensel
These are the FoxyProxy wildcards I use *compute-1.amazonaws.com* *.ec2.internal* *.compute-1.internal* and w/ hadoop 0.17.0, just type (after booting your cluster) hadoop-ec2 proxy to start the tunnel for that cluster On Jun 3, 2008, at 11:26 PM, James Moore wrote: On Tue, Jun 3, 2008 at

Re: hadoop on EC2

2008-06-04 Thread Steve Loughran
Andreas Kostyrka wrote: Well, the basic "trouble" with EC2 is that clusters usually are not networks in the TCP/IP sense. This makes it painful to decide which URLs should be resolved where. Plus to make it even more painful, you cannot easily run it with one simple SOCKS server, because you

Re: hadoop on EC2

2008-06-03 Thread James Moore
On Tue, Jun 3, 2008 at 5:04 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Plus to make it even more painful, you cannot easily run it with one simple > SOCKS server, because you need to defer DNS resolution to the inside the > cluster, because VM names do resolve to external IPs, while the webs

Re: hadoop on EC2

2008-06-03 Thread Andreas Kostyrka
Well, the basic "trouble" with EC2 is that clusters usually are not networks in the TCP/IP sense. This makes it painful to decide which URLs should be resolved where. Plus to make it even more painful, you cannot easily run it with one simple SOCKS server, because you need to defer DNS resoluti

Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel
obviously this isn't the best solution if you need to let many semi trusted users browse your cluster. Actually, it would be much more secure if the tunnel service ran on a trusted server letting your users connect remotely via SOCKS and then browse the cluster. These users wouldn't need

Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel
if you use the new scripts in 0.17.0, just run > hadoop-ec2 proxy this starts a ssh tunnel to your cluster. installing foxy proxy in FF gives you whole cluster visibility.. obviously this isn't the best solution if you need to let many semi trusted users browse your cluster. On May 28, 20

Re: hadoop on EC2

2008-05-30 Thread Andreas Kostyrka
On Wednesday 28 May 2008 23:16:43 Chris Anderson wrote: > Andreas, > > If you can ssh into the nodes, you can always set up port-forwarding > with ssh -L to bring those ports to your local machine. Yes, and the missing part is simple too: iptables with DNAT on OUTPUT :) I even made a small ugly s

Re: hadoop on EC2

2008-05-28 Thread Nate Carlson
up cgiproxy[1] on it, and only allowed it to forward to the hadoop nodes. That way, when you click a link to go to a slave node's logs or whatnot it still works. ;) I no longer use Hadoop on ec2, but still use the config described above! [1]http://www.jmarshall.com/tools/

Re: hadoop on EC2

2008-05-28 Thread Jim R. Wilson
Recently I spent some time hacking the contrib/ec2 scripts to install and configure OpenVPN on top of the other installed packages. Our use case required that all the slaves running mappers would need to connect back through to our primary mysql database (firewalled as you can imagine). Simultane

Re: hadoop on EC2

2008-05-28 Thread Chris Anderson
On Wed, May 28, 2008 at 2:23 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > That doesn't work because the various web pages have links or redirects to > other pages on other machines. > > Also, you would need to ssh to ALL of your cluster to get the file browser > to work. True. That makes it a li

Re: hadoop on EC2

2008-05-28 Thread Ted Dunning
That doesn't work because the various web pages have links or redirects to other pages on other machines. Also, you would need to ssh to ALL of your cluster to get the file browser to work. Better to do the proxy thing. On 5/28/08 2:16 PM, "Chris Anderson" <[EMAIL PROTECTED]> wrote: > Andreas

Re: hadoop on EC2

2008-05-28 Thread Chris Anderson
Andreas, If you can ssh into the nodes, you can always set up port-forwarding with ssh -L to bring those ports to your local machine. On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > What I wonder is what ports do I need to access? > > 50060 on all nodes. > 50030 on

Re: hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
What I wonder is what ports do I need to access? 50060 on all nodes. 50030 on the jobtracker. Any other ports? Andreas Am Mittwoch, den 28.05.2008, 13:37 -0700 schrieb Allen Wittenauer: > > > On 5/28/08 1:22 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > > I just wondered what other peop

Re: hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
That presumes that you have a static source address. Plus for nontechnical reasons changing the firewall rules is nontrivial. (I'm responsible for the inside of the VMs, but somebody else holds the ec2 keys, don't ask) Andreas Am Mittwoch, den 28.05.2008, 16:27 -0400 schrieb Jake Thompson: > What

Re: hadoop on EC2

2008-05-28 Thread Allen Wittenauer
On 5/28/08 1:22 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > I just wondered what other people use to access the hadoop webservers, > when running on EC2? While we don't run on EC2 :), we do protect the hadoop web processes by putting a proxy in front of it. A user connects to the p

Re: hadoop on EC2

2008-05-28 Thread Jake Thompson
What is wron with opening up the ports only to the hosts that you want to have access to them. This is what I cam currently doing, -s 0.0.0.0/0 is everyone everywhere so change it to -s my.ip.add.ress/32 On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Hi! > > I j

hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
Hi! I just wondered what other people use to access the hadoop webservers, when running on EC2? Ideas that I had: 1.) opening ports 50030 and so on => not good, data goes unprotected over the internet. Even if I could enable some form of authentication it would still plain http. 2.) Some kind of

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel
ll be better knowing what's going on. also, setting up FoxyProxy on firefox lets you browse your whole cluster if you setup a ssh tunnel (socks). On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus).

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Prasan Ary
oing on. also, setting up FoxyProxy on firefox lets you browse your whole cluster if you setup a ssh tunnel (socks). On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote: > Hi All, > I have been trying to configure Hadoop on EC2 for large number of > clusters ( 100 plus). It seems that I have to

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andreas Kostyrka
id_rsa" >sleep 1 > done > > Anyway, did you tried hadoop-ec2 script? It works well for task you > described. > > > Prasan Ary wrote: > > Hi All, > > I have been trying to configure Hadoop on EC2 for large number of > > clusters ( 10

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Tom White
Yes, this isn't ideal for larger clusters. There's a jira to address this: https://issues.apache.org/jira/browse/HADOOP-2410. Tom On 20/03/2008, Prasan Ary <[EMAIL PROTECTED]> wrote: > Hi All, > I have been trying to configure Hadoop on EC2 for large number of clusters &

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel
have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andrey Pankov
TECTED]" "chmod 600 /root/.ssh/id_rsa" sleep 1 done Anyway, did you tried hadoop-ec2 script? It works well for task you described. Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to cop

Hadoop on EC2 for large cluster

2008-03-20 Thread Prasan Ary
Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the