Hey all,

  Just an update on the new-and-improved command-line "UI" we have now.
 After a ton of iterations back and forth with Drew (thanks!), MAHOUT-301
has been committed, and brings with it the easy ability to trim down your
long long command lines for most of our *Driver main() methods, by saving
your default command-line arguments for various drivers in properties files
(which are then overridable via the command line), either locally or on
hadoop.  Feature-set is as follows (usage after that):

  Either from the binary distribution or from source (after having done "mvn
install", naturally), this is the setup - there are a bunch of properties
files with a kludgey format (because I didn't want to dig into the xml
rathole, and while a nice flexible schema is nice, I opted to follow the
YAGNI principle) :

  *) there is a new directory "conf" at the top level (of the binary dist,
as well as source), which contains a bunch of *.props files: one special one
called driver.classes.props, which has the mapping between (the keys)
fully-qualified class name of a class which has a main() method, and the
"short-name" (the values) and brief description.  The current file is just
the following:

###
org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors
from a sequence file to text
org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump
cluster output to text
org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence
File dumper
org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means
clustering
org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy
K-means clustering
org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet
Allocation
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern
Growth
org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet :
Dirichlet Clustering
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift :
Mean Shift clustering
org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy
clustering
org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate
Vectors from a Lucene index
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate
sequence files (of Text) from a directory
org.apache.mahout.text.SparseVectorsFromSequenceFiles = seq2sparse: Sparse
Vector generation from Text sequence files
org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml
dump to sequence file
org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test
Bayes Classifier
org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train
Bayes Classifier
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd :
Lanczos Singular Value Decomposition
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd :
Cleanup and verification of SVD output
###

It's meant to be read into java.util.Properties, where the values on the
right hand side are further split by ":" into the short-name (to be used on
the command-line) and the description (printed to stdout if an invalid input
is made or "-h" is used with no class name to run).  *If there are missing
classes from this list, please add them!*

  *) there are also a bunch of files in conf/ which are named
<shortName>.props, where <shortName> is one of the driver.classes.props
above.  These files are  mostly empty now (well, commented out), but for
example, conf/svd.props is currently:

#i|input =
#o|output =
#nr|numRows =
#nc|numCols =
#r|rank =
#t|tempDir =

the format of these props files is that the key is of the form
"singleDashCmdLineOpt|doubleDashCmdLineOpt", (if there is no "|" in the key,
the short and long form will be assumed to be the same) and the value is
whatever you would want that option to be (does not currently support
options with no value, this is a TODO).   So for example if you had a
command line such as:

> $MAHOUT_HOME/bin/mahout svd --input /path/to/input -o /path/to/output -nr
<numRows> --numCols <numCols> -r <rank> -t /tmp/svd

You could just uncomment the lines in conf/svd.conf as

i|input = /path/to/input
o|output = /path/to/output
nr|numRows = <numRows>
nc|numCols = <numCols>
r|rank = <rank>
t|tempDir = /tmp/svd

and run as

> $MAHOUT_HOME/bin/mahout svd

If you wanted to run a second time, but you didn't want to overwrite your
old results, you could then do

> $MAHOUT_HOME/bin/mahout svd -o /path/to/newOutput

which would override /path/to/output and instead use /path/to/newOutput,
with all the other properties coming from the svd.props.

  *) the $MAHOUT_HOME/conf directory is just a template - the mahout shell
script adds $MAHOUT_CONF_DIR to the classpath (or $MAHOUT_HOME/conf if
$MAHOUT_CONF_DIR is not defined), and MahoutDriver reads the properties
files from the classpath.

  *) running on Hadoop:  if your $HADOOP_HOME and $HADOOP_CONF_DIR are set,
the mahout shell script automatically launches your requested main method to
your hadoop cluster, otherwise it's run locally.

  *) if your main() isn't defined in driver.classes.properties, that's ok,
it'll still run via:


 $MAHOUT_HOME/bin/mahout org.apache.mahout.blah.blah.SomeOtherDriver [remaining
args]

and in fact, if you put "org.apache.mahout.blah.blah.SomeOtherDriver.props"
on your classpath, and has the format for the <shortName>.props listed
above, it will be used for default properties for this class.

--------

I'll put this up in some nicer form for the wiki in the next couple of days.


Try out various driver classes that you use - we all use different ones, so
getting some dev/user manual test coverage would be nice, because it's kinda
tricky to unit test shell scripts and command line args and env variables
(and running on a real cluster, etc...).  We should try to fix any bugs
before release.

Feedback welcome.  It's hacky, but it adds some useful functionality, and we
can clean up the props-file syntax (or ditch it for xml/yaml/json/whatever)
as needed later.

  -jake

Reply via email to