[CONF] Apache Lucene Mahout > MahoutEC2

confluence Wed, 12 May 2010 13:15:23 -0700

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: MahoutEC2 (http://cwiki.apache.org/confluence/display/MAHOUT/MahoutEC2)



Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Mahout on Amazon EC2

Amazon EC2 is a compute-on-demand platform sold by Amazon.com that allows users 
to purchase one or more host machines on an hourly basis and execute 
applications.  Since Hadoop can run on EC2, it is also possible to run Mahout 
on EC2.  The following sections will detail how to do this.

  
h1. Prerequisites

To run Mahout on EC2 you need to start up a Hadoop cluster on one or more 
instances of a Hadoop-0.20.2 compatible Amazon Machine Instance (AMI). 
Unfortunately, there do not currently exist any public AMIs that support 
Hadoop-0.20.2; you will have to create one. The following steps begin with a 
public Cloudera Ubuntu AMI that comes with Java installed on it. You could use 
any other AMI with Java installed or you could use a clean AMI and install Java 
yourself.

# From the [AWS Management 
Console|https://console.aws.amazon.com/ec2/home#c=EC2&s=Home]/AMIs, start the 
following AMI (_ami-8759bfee_)
{code}
cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-x86_64.manifest.xml 
{code}
# From the AWS Console/Instances, select the instance and right-click 'Connect" 
to get the connect string which contains your <instance public DNS name>
{code}
> ssh -i <gsg-keypair.pem> root@<instance public DNS name>
{code}
# in the root home directory evaluate:
{code}
# apt-get install python-setuptools
# easy_install "simplejson==2.0.9"
# easy_install "boto==1.8d"
# apt-get install ant
# apt-get install subversion
# apt-get install maven2
{code}
# add the following to your .profile
{code}
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export MAHOUT_HOME=~/mahout
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m
{code}
# upload the Hadoop distribution and configure it. *TODO* This distribution is 
not available on the Hadoop site. Where did we get it from?
{code}
> scp -i <gsg-keypair.pem>  <where>/hadoop-0.20.2.tar.gz root@<instance public 
> DNS name>:.

# tar -xzf hadoop-0.20.2.tar.gz
# mv hadoop-0.20.2 /usr/local/.
{code}
# configure Hadoop for temporary single node operation
## add the following to $HADOOP_HOME/conf/hadoop-env.sh
{code}
# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000
{code}
## add the following to $HADOOP_HOME/conf/core-site.xml
{code}
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
        <!-- set to 1 to reduce warnings when 
        running on a single node -->
  </property>
</configuration>
{code}
## add the following to $HADOOP_HOME/conf/mapred-site.xml
{code}
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
        <!-- set to 1 to reduce warnings when 
        running on a single node -->
  </property>
</configuration>
{code}
## set up authorized keys for localhost login w/o passwords and format your 
name node
{code}
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# $HADOOP_HOME/bin/hadoop namenode -format
{code}
# checkout and build Mahout
{code}
# svn co http://svn.apache.org/repos/asf/mahout/trunk mahout 
# cd mahout
# mvn install
{code}
# run Hadoop, just to prove you can, and test Mahout by building the Reuters 
dataset on it. Finally, shut it down.
{code}
# $HADOOP_HOME/hadoop namenode -format
# $HADOOP_HOME/bin/start-all.sh
# jps     // you should see all 5 Hadoop processes (NameNode, 
SecondaryNameNode, DataNode, JobTracker, TaskTracker)
# mahout/examples/bin/build-reuters.sh

# $HADOOP_HOME/bin/stop-all.sh
{code}
# now we need to convert our image into a new AMI
*TODO*

h1. Getting Started

*TODO*

h1. Running the Examples

*TODO*

h1. References

[Hadoop's instructions|http://wiki.apache.org/hadoop/AmazonEC2]


h1. Recognition

Some of the information available here was possible through the "Amazon Web 
Services Apache Projects Testing Program".

Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Lucene Mahout > MahoutEC2

Reply via email to