[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-301:
-------------------------------

    Attachment: MAHOUT-301.patch

Fancy new version.  Run as follows:

Set your $MAHOUT_CONF_DIR to a directory where you will have your own overrides 
(or, if unset, defaults to ./core/src/main/resources).

In that directory, there should be a file called "driver.classes.props" with 
contents like so:
{code}
org.apache.mahout.utils.vectors.VectorDumper="vecDump"
org.apache.mahout.utils.clustering.ClusterDumper="clusty"
org.apache.mahout.utils.SequenceFileDumper="seqDump"
org.apache.mahout.clustering.kmeans.KMeansDriver="kmeans"
org.apache.mahout.clustering.canopy.CanopyDriver="canopy"
org.apache.mahout.utils.vectors.lucene.Driver="luceneVecs"
org.apache.mahout.text.SequenceFilesFromDirectory="dirToSeq"
org.apache.mahout.text.WikipediaToSequenceFile="wikToSeq"
org.apache.mahout.classifier.bayes.TestClassifier="TestClassifier"
{code}

Etc.  The right hand side can be whatever you want, *but* whatever it is 
determines where MahoutDriver will look for a default properties file.  For 
example:

{code}
$MAHOUT_HOME/bin/mahout run wikToSeq
{code}

would look for the file $MAHOUT_CONF_DIR/wikToSeq.props and in that file, take 
each line and transform it into command line arguments for 
WikipediaToSequenceFile, using the logic as follows:

on each line of wikToSeq.props, there is a key-value pair:

{code}
i | input = my/wiki/input/path
o | output = my/output/path
c | categories = my/wikiCategories/file
e | exactMatch = true
all = true
{code}

The part of the key before the vertical bar is the short-name of the argument 
to pass, and the second part is the long name.  If there is only one, they are 
assumed to be the same.

You can also pass Hadoop options here, like 
{code}
Djava.io.tmpdir = /var/tmp/mahout 
{code}

which would lead to the program being called with 
"-Djava.io.tmpdir=/var/tmp/mahout" passed in.


> Improve command-line shell script by allowing default properties files
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-301
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-301
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Utils
>    Affects Versions: 0.3
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch
>
>
> Snippet from javadoc gives the idea:
> {code}
> /**
>  * General-purpose driver class for Mahout programs.  Utilizes 
> org.apache.hadoop.util.ProgramDriver to run
>  * main methods of other classes, but first loads up default properties from 
> a properties file.
>  *
>  * Usage: run on Hadoop like so:
>  *
>  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
> org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
>  *   [default.props file for this class] [over-ride options, all specified in 
> long form: --input, --jarFile, etc]
>  *
>  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
> isn't needed?
>  *
>  * (note: using the current shell scipt, this could be modified to be just 
>  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
> file] [over-ride options]
>  * )
>  *
>  * Works like this: by default, the file 
> "core/src/main/resources/driver.classes.prop" is loaded, which
>  * defines a mapping between short names like "VectorDumper" and fully 
> qualified class names.  This file may
>  * instead be overridden on the command line by having the first argument be 
> some string of the form *classes.props.
>  *
>  * The next argument to the Driver is supposed to be the short name of the 
> class to be run (as defined in the
>  * driver.classes.props file).  After this, if the next argument ends in 
> ".props" / ".properties", it is taken to
>  * be the file to use as the default properties file for this execution, and 
> key-value pairs are built up from that:
>  * if the file contains
>  *
>  * input=/path/to/my/input
>  * output=/path/to/my/output
>  *
>  * Then the class which will be run will have it's main called with
>  *
>  *   main(new String[] { "--input", "/path/to/my/input", "--output", 
> "/path/to/my/output" });
>  *
>  * After all the "default" properties are loaded from the file, any further 
> command-line arguments are taken in,
>  * and over-ride the defaults.
>  */
> {code}
> Could be cleaned up, as it's kinda ugly with the whole "file named in 
> .props", but gives the idea.  Really helps cut down on repetitive long 
> command lines, lets defaults be put props files instead of locked into the 
> code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to