Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: lda-commandline
(http://cwiki.apache.org/confluence/display/MAHOUT/lda-commandline)
Added by Jeff Eastman:
---------------------------------------------------------------------
h1. Running Latent Dirichlet Allocation from the Command Line
Mahout's LDA can be launched from the same command line invocation whether you
are running on a single machine in stand-alone mode or on a larger Hadoop
cluster. The difference is determined by the $HADOOP_HOME and $HADOOP_CONF_DIR
environment variables. If both are set to an operating Hadoop cluster on the
target machine then the invocation will run LDA on that cluster. If either of
the environment variables are missing then the stand-alone Hadoop configuration
will be invoked instead.
{code}
./bin/mahout lda <OPTIONS>
{code}
* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will
be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout
version number. For example, when using Mahout 0.3 release, the job will be
mahout-core-0.3.job
h2. Testing it on one single machine w/o cluster
* Put the data: cp <PATH TO DATA> testdata
* Run the Job:
{code}
./bin/mahout lda -i testdata <OTHER OPTIONS>
{code}
h2. Running it on the cluster
* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job:
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout lda -i testdata <OTHER OPTIONS>
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to
view all outputs.
h1. Command line options
{code}
--input (-i) input Path to job input directory. Must be
a SequenceFile of VectorWritable
--output (-o) output The directory pathname for output.
--numTopics (-k) numTopics The total number of topics in the
corpus
--numWords (-v) numWords The total number of words in the
corpus (can be approximate, needs to
exceed the actual value)
--topicSmoothing (-a) topicSmoothing Topic smoothing parameter. Default is
50/numTopics.
--maxIter (-x) maxIter The maximum number of iterations.
--maxRed (-r) maxRed The number of reduce tasks. Defaults
to 2
--overwrite (-ow) If present, overwrite the output
directory before running job
--help (-h) Print out help
{code}
Change your notification preferences:
http://cwiki.apache.org/confluence/users/viewnotifications.action