Hey all, Just an update on the new-and-improved command-line "UI" we have now. After a ton of iterations back and forth with Drew (thanks!), MAHOUT-301 has been committed, and brings with it the easy ability to trim down your long long command lines for most of our *Driver main() methods, by saving your default command-line arguments for various drivers in properties files (which are then overridable via the command line), either locally or on hadoop. Feature-set is as follows (usage after that):
Either from the binary distribution or from source (after having done "mvn install", naturally), this is the setup - there are a bunch of properties files with a kludgey format (because I didn't want to dig into the xml rathole, and while a nice flexible schema is nice, I opted to follow the YAGNI principle) : *) there is a new directory "conf" at the top level (of the binary dist, as well as source), which contains a bunch of *.props files: one special one called driver.classes.props, which has the mapping between (the keys) fully-qualified class name of a class which has a main() method, and the "short-name" (the values) and brief description. The current file is just the following: ### org.apache.mahout.utils.vectors.VectorDumper = vectordump : Dump vectors from a sequence file to text org.apache.mahout.utils.clustering.ClusterDumper = clusterdump : Dump cluster output to text org.apache.mahout.utils.SequenceFileDumper = seqdumper : Generic Sequence File dumper org.apache.mahout.clustering.kmeans.KMeansDriver = kmeans : K-means clustering org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver = fkmeans : Fuzzy K-means clustering org.apache.mahout.clustering.lda.LDADriver = lda : Latent Dirchlet Allocation org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver = fpg : Frequent Pattern Growth org.apache.mahout.clustering.dirichlet.DirichletDriver = dirichlet : Dirichlet Clustering org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver = meanshift : Mean Shift clustering org.apache.mahout.clustering.canopy.CanopyDriver = canopy : Canopy clustering org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate Vectors from a Lucene index org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate sequence files (of Text) from a directory org.apache.mahout.text.SparseVectorsFromSequenceFiles = seq2sparse: Sparse Vector generation from Text sequence files org.apache.mahout.text.WikipediaToSequenceFile = seqwiki : Wikipedia xml dump to sequence file org.apache.mahout.classifier.bayes.TestClassifier = testclassifier : Test Bayes Classifier org.apache.mahout.classifier.bayes.TrainClassifier = trainclassifier : Train Bayes Classifier org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver = svd : Lanczos Singular Value Decomposition org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob = cleansvd : Cleanup and verification of SVD output ### It's meant to be read into java.util.Properties, where the values on the right hand side are further split by ":" into the short-name (to be used on the command-line) and the description (printed to stdout if an invalid input is made or "-h" is used with no class name to run). *If there are missing classes from this list, please add them!* *) there are also a bunch of files in conf/ which are named <shortName>.props, where <shortName> is one of the driver.classes.props above. These files are mostly empty now (well, commented out), but for example, conf/svd.props is currently: #i|input = #o|output = #nr|numRows = #nc|numCols = #r|rank = #t|tempDir = the format of these props files is that the key is of the form "singleDashCmdLineOpt|doubleDashCmdLineOpt", (if there is no "|" in the key, the short and long form will be assumed to be the same) and the value is whatever you would want that option to be (does not currently support options with no value, this is a TODO). So for example if you had a command line such as: > $MAHOUT_HOME/bin/mahout svd --input /path/to/input -o /path/to/output -nr <numRows> --numCols <numCols> -r <rank> -t /tmp/svd You could just uncomment the lines in conf/svd.conf as i|input = /path/to/input o|output = /path/to/output nr|numRows = <numRows> nc|numCols = <numCols> r|rank = <rank> t|tempDir = /tmp/svd and run as > $MAHOUT_HOME/bin/mahout svd If you wanted to run a second time, but you didn't want to overwrite your old results, you could then do > $MAHOUT_HOME/bin/mahout svd -o /path/to/newOutput which would override /path/to/output and instead use /path/to/newOutput, with all the other properties coming from the svd.props. *) the $MAHOUT_HOME/conf directory is just a template - the mahout shell script adds $MAHOUT_CONF_DIR to the classpath (or $MAHOUT_HOME/conf if $MAHOUT_CONF_DIR is not defined), and MahoutDriver reads the properties files from the classpath. *) running on Hadoop: if your $HADOOP_HOME and $HADOOP_CONF_DIR are set, the mahout shell script automatically launches your requested main method to your hadoop cluster, otherwise it's run locally. *) if your main() isn't defined in driver.classes.properties, that's ok, it'll still run via: $MAHOUT_HOME/bin/mahout org.apache.mahout.blah.blah.SomeOtherDriver [remaining args] and in fact, if you put "org.apache.mahout.blah.blah.SomeOtherDriver.props" on your classpath, and has the format for the <shortName>.props listed above, it will be used for default properties for this class. -------- I'll put this up in some nicer form for the wiki in the next couple of days. Try out various driver classes that you use - we all use different ones, so getting some dev/user manual test coverage would be nice, because it's kinda tricky to unit test shell scripts and command line args and env variables (and running on a real cluster, etc...). We should try to fix any bugs before release. Feedback welcome. It's hacky, but it adds some useful functionality, and we can clean up the props-file syntax (or ditch it for xml/yaml/json/whatever) as needed later. -jake