[CONF] Apache Lucene Mahout > lda-commandline

confluence Fri, 04 Jun 2010 09:06:22 -0700

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: lda-commandline 
(http://cwiki.apache.org/confluence/display/MAHOUT/lda-commandline)


Added by Jeff Eastman:
---------------------------------------------------------------------
h1. Running Latent Dirichlet Allocation from the Command Line
Mahout's LDA can be launched from the same command line invocation whether you 
are running on a single machine in stand-alone mode or on a larger Hadoop 
cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR 
environment variables. If both are set to an operating Hadoop cluster on the 
target machine then the invocation will run LDA on that cluster. If either of 
the environment variables are missing then the stand-alone Hadoop configuration 
will be invoked instead.

{code}
./bin/mahout lda <OPTIONS>
{code}

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will 
be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout 
version number. For example, when using Mahout 0.3 release, the job will be 
mahout-core-0.3.job


h2. Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job: 
{code}
./bin/mahout lda -i testdata <OTHER OPTIONS>
{code}

h2. Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: 
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout lda -i testdata <OTHER OPTIONS>
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to 
view all outputs.

h1. Command line options
{code}
  --input (-i) input                      Path to job input directory. Must be  
                                          a SequenceFile of VectorWritable      
  --output (-o) output                    The directory pathname for output.    
  --numTopics (-k) numTopics              The total number of topics in the     
                                          corpus                                
  --numWords (-v) numWords                The total number of words in the      
                                          corpus (can be approximate, needs to  
                                          exceed the actual value)              
  --topicSmoothing (-a) topicSmoothing    Topic smoothing parameter. Default is 
                                          50/numTopics.                         
  --maxIter (-x) maxIter                  The maximum number of iterations.     
  --maxRed (-r) maxRed                    The number of reduce tasks. Defaults  
                                          to 2                                  
  --overwrite (-ow)                       If present, overwrite the output      
                                          directory before running job          
  --help (-h)                             Print out help                        
{code}

Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Lucene Mahout > lda-commandline

Reply via email to