[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837448#action_12837448
 ] 

Drew Farris commented on MAHOUT-301:
------------------------------------

{quote}
Hmm... ok. I'm a little reticent about running -core when testing, because I'm 
not really testing what the release run will be like - I like the idea of 
having a single set of dependencies (jars, not classes directories) which are 
used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just 
not familiar with the -core option and it's use.
{quote}

Ahh, I see where you're coming from, so without core, you're suggesting that 
mahout pick up the jar files in the target directories if they exist? I think 
it is fine to modify the non-core classpath to include these, they won't be 
present in the release build anyway.

{quote}
The last step, as you've noted, is because I'm not sure that the script 
actually properly lets HADOOP_CONF_DIR properly get passed through the mahout 
shell script to actually running on the hadoop cluster, but maybe that's just a 
config issue in my case? Also means that in fact the default properties idea 
still doesn't work on hadoop, unless the default properties files are pushed to 
the classpath.
{quote}

Are any of the default properties files used beyond the MahoutDriver, which 
executes locally and sets up the job? Do these files need to be distributed to 
the rest of the cluster? As noted above, I think the proper way to run 
MahoutDriver in the context of a distributed job is to do something like:

{code}
./bin/mahout org.apache.hadoop.util.RunJar 
/path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver 
TestClassifier
{code}

I suspect we could easilly modify the mahout script and shorten this to:

{code}
./bin/mahout runjob TestClassifier
{code}

I can look at this a little closer tonight, so if you have an updated patch for 
me to work on/test in a few hours, definitely post it. I'd be happy to make any 
changes you're interested in.

{quote}
What is the right way run a job with some additional (runtime) files added to 
the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?
{quote}

FWIW, 
[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser]
 provides a way to do this with -files, -libjars and -archives


> Improve command-line shell script by allowing default properties files
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-301
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-301
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Utils
>    Affects Versions: 0.3
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
> MAHOUT-301.patch, MAHOUT-301.patch
>
>
> Snippet from javadoc gives the idea:
> {code}
> /**
>  * General-purpose driver class for Mahout programs.  Utilizes 
> org.apache.hadoop.util.ProgramDriver to run
>  * main methods of other classes, but first loads up default properties from 
> a properties file.
>  *
>  * Usage: run on Hadoop like so:
>  *
>  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
> org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
>  *   [default.props file for this class] [over-ride options, all specified in 
> long form: --input, --jarFile, etc]
>  *
>  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
> isn't needed?
>  *
>  * (note: using the current shell scipt, this could be modified to be just 
>  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
> file] [over-ride options]
>  * )
>  *
>  * Works like this: by default, the file 
> "core/src/main/resources/driver.classes.prop" is loaded, which
>  * defines a mapping between short names like "VectorDumper" and fully 
> qualified class names.  This file may
>  * instead be overridden on the command line by having the first argument be 
> some string of the form *classes.props.
>  *
>  * The next argument to the Driver is supposed to be the short name of the 
> class to be run (as defined in the
>  * driver.classes.props file).  After this, if the next argument ends in 
> ".props" / ".properties", it is taken to
>  * be the file to use as the default properties file for this execution, and 
> key-value pairs are built up from that:
>  * if the file contains
>  *
>  * input=/path/to/my/input
>  * output=/path/to/my/output
>  *
>  * Then the class which will be run will have it's main called with
>  *
>  *   main(new String[] { "--input", "/path/to/my/input", "--output", 
> "/path/to/my/output" });
>  *
>  * After all the "default" properties are loaded from the file, any further 
> command-line arguments are taken in,
>  * and over-ride the defaults.
>  */
> {code}
> Could be cleaned up, as it's kinda ugly with the whole "file named in 
> .props", but gives the idea.  Really helps cut down on repetitive long 
> command lines, lets defaults be put props files instead of locked into the 
> code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to