[CONF] Apache Lucene Mahout > canopy-commandline

confluence Fri, 04 Jun 2010 08:42:27 -0700

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: canopy-commandline 
(http://cwiki.apache.org/confluence/display/MAHOUT/canopy-commandline)



Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Running Canopy Clustering from the Command Line
Mahout's Canopy clustering can be launched from the same command line 
invocation whether you are running on a single machine in stand-alone mode or 
on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME 
and $HADOOP_CONF_DIR environment variables. If both are set to an operating 
Hadoop cluster on the target machine then the invocation will run Canopy on 
that cluster. If either of the environment variables are missing then the 
stand-alone Hadoop configuration will be invoked instead.

{code}
./bin/mahout canopy <OPTIONS>
{code}

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will 
be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout 
version number. For example, when using Mahout 0.3 release, the job will be 
mahout-core-0.3.job


h2. Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job: 
{code}
./bin/mahout canopy -i testdata -o output -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
{code}

h2. Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: 
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout canopy -i testdata -o output -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to 
view all outputs.

h1. Command line options
{code}
  --input (-i) input                         Path to job input directory. Must  
                                             be a SequenceFile of               
                                             VectorWritable                     
  --output (-o) output                       The directory pathname for output. 
  --overwrite (-ow)                          If present, overwrite the output   
                                             directory before running job       
  --distanceMeasure (-dm) distanceMeasure    The classname of the               
                                             DistanceMeasure. Default is        
                                             SquaredEuclidean                   
  --t1 (-t1) t1                              T1 threshold value                 
  --t2 (-t2) t2                              T2 threshold value                 
  --clustering (-cl)                         If present, run clustering after   
                                             the iterations have taken place    
  --help (-h)                                Print out help                     
{code}

Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Lucene Mahout > canopy-commandline

Reply via email to