[off-topic] Maven and SCP deploy.

2010-02-23 Thread Dawid Weiss
There are many folks knowledgeable about maven on this list, so I
thought I'd ask -- I'm trying to write a POM with scp deployment, but
maven consistently fails for me with authentication errors -- this is
most likely caused by an outdated (and buggy) jsch dependency (0.1.38
instead of 0.1.42). Anybody knows how to override this dependency from
within the POM for wagon-ssh? I tried a dozen different
configurations, but none work.

Dawid


[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837156#action_12837156
 ] 

Sean Owen commented on MAHOUT-305:
--

You don't entirely drop ratings, it's that they don't figure into the 
similarity metric, right? But yes, ratings are not relevant to the point I was 
trying to make. Regardless, Harry Potter 3 is a better recommendation.

I really think you have to take out the highest-rated items or this is a fairly 
flawed test, for this reason. Does anyone else have experience in or thoughts 
on defining precision and recall in this context? 3,4,5 is arbitrary, just pick 
the top n, or top n%, I'd imagine.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Look! No more ISSUES

2010-02-23 Thread Robin Anil
Maybe announce a release candidate quickly. Then we might get feedback on
any bugs from users outside

Robin

On Tue, Feb 23, 2010 at 3:10 PM, Sean Owen sro...@gmail.com wrote:

 I'm happy to play release engineer. Per
 http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release :

 - I'm targeting a release for as soon as possible. Let's say this Friday.
 - Let's call a code freeze right now. No changes except:
  - Javadoc fixes and improvements
  - New unit tests
  - Bug fixes due to new tests and new issues discovered now
 - Focus on testing, examining the build, running examples

 Meanwhile I'm going to start the motions for release so we can
 discover and discuss any wrinkles in the process that we didn't see
 last time.

 Sean

 On Mon, Feb 22, 2010 at 10:34 PM, Robin Anil robin.a...@gmail.com wrote:
  waiting for 301 to get commited.
 
 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751styleName=Htmlversion=12314281
 
  PMC's. Its in your hands now :D
 
  Robin
 



Re: data mining tool in hadoop

2010-02-23 Thread Robin Anil
Check out Apache Mahout. http://lucene.apache.org/mahout You are welcome to
contribute

--
Robin Anil
Blog: http://techdigger.wordpress.com
---

Mahout in Action - Mammoth Scale machine learning
Read Chapter 1 - Its Frrr
http://www.manning.com/owen

Try out Swipeball for iPhone
http://itunes.com/apps/swipeball



On Tue, Feb 23, 2010 at 4:00 PM, btp...@tce.edu wrote:

 hi all
  i am new to this hadoop environment.i like to develope a data mining
 tool for classification.Is there any data mining tool available in
 hadoop???
 thanks
 parvathi




 -
 This email was sent using TCEMail Service.
 Thiagarajar College of Engineering
 Madurai-625 015, India




[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837192#action_12837192
 ] 

Ankur commented on MAHOUT-305:
--

Just picking random N % data for each user calculating avg precision and recall 
across all users in test data  and then repeating the test K times to take 
average across all runs should be reasonably fair assessment IMHO.

Mahouters your opinion here would be valuable.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Look! No more ISSUES

2010-02-23 Thread Isabel Drost
On Tue Sean Owen sro...@gmail.com wrote:
 I'm happy to play release engineer.

Great - Thanks, Sean.

Isabel


[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837193#action_12837193
 ] 

Sean Owen commented on MAHOUT-305:
--

I just don't think it can be random. it's like doing a PR test on search 
results and defining relevant documents as some randomly-chosen subset of all 
documents.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Look! No more ISSUES

2010-02-23 Thread Sean Owen
Any thoughts on the highlights we call out for 0.3 on the web site? Is
this anything like right?

liNew math, collections modules/li
liLLR colocation implementation/li
liFP-bonsai implementation/li
liHadoop-based Lanczos SVD solver/li
liShell scripts for easier running of algorithms, examples/li
li... and much much more: code cleanup, many bug fixes
and performance improvements/li

On Tue, Feb 23, 2010 at 11:02 AM, Isabel Drost isa...@apache.org wrote:
 On Tue Sean Owen sro...@gmail.com wrote:
 I'm happy to play release engineer.

 Great - Thanks, Sean.

 Isabel



Re: Look! No more ISSUES

2010-02-23 Thread Robin Anil
this was the format last time


New Mahout 0.2 features include

   - Major performance enhancements in Collaborative Filtering,
   Classification and Clustering
   - New: Latent Dirichlet Allocation(LDA) implementation for topic
   modelling
   - New: Frequent Itemset Mining for mining top-k patterns from a list of
   transactions
   - New: Decision Forests implementation for Decision Tree classification
   (In Memory  Partial Data)
   - New: HBase storage support for Naive Bayes model building and
   classification
   - New: Generation of vectors from Text documents for use with Mahout
   Algorithms
   - Performance improvements in various Vector implementations
   - Tons of bug fixes and code cleanup


On Tue, Feb 23, 2010 at 4:45 PM, Sean Owen sro...@gmail.com wrote:

 Any thoughts on the highlights we call out for 0.3 on the web site? Is
 this anything like right?

liNew math, collections modules/li
liLLR colocation implementation/li
liFP-bonsai implementation/li
liHadoop-based Lanczos SVD solver/li
liShell scripts for easier running of algorithms,
 examples/li
li... and much much more: code cleanup, many bug fixes
 and performance improvements/li

 On Tue, Feb 23, 2010 at 11:02 AM, Isabel Drost isa...@apache.org wrote:
  On Tue Sean Owen sro...@gmail.com wrote:
  I'm happy to play release engineer.
 
  Great - Thanks, Sean.
 
  Isabel
 



Re: Look! No more ISSUES

2010-02-23 Thread Sean Owen
Format for what file? I'm editing the site's index.xml file here.

On Tue, Feb 23, 2010 at 11:18 AM, Robin Anil robin.a...@gmail.com wrote:
 this was the format last time


 New Mahout 0.2 features include



Re: Look! No more ISSUES

2010-02-23 Thread Robin Anil
There was a thread that went out for release announcements last time. Maybe
we can start one for this one

Robin


On Tue, Feb 23, 2010 at 4:52 PM, Sean Owen sro...@gmail.com wrote:

 Format for what file? I'm editing the site's index.xml file here.

 On Tue, Feb 23, 2010 at 11:18 AM, Robin Anil robin.a...@gmail.com wrote:
  this was the format last time
 
 
  New Mahout 0.2 features include
 



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837198#action_12837198
 ] 

Ankur commented on MAHOUT-305:
--

I am not proposing that we choose random subset over all movies.  Rather choose 
random N% movie ratings  from EACH user and use it as test data to get 
precision recall across this test set.  Also repeat this procedure X times to 
get a fair assessment. They seem to do it the same way - 
http://www2007.org/papers/paper570.pdf 

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837203#action_12837203
 ] 

Sean Owen commented on MAHOUT-305:
--

Yes I understand that, and it still doesn't change the issue here.

The paper here deals with a data set with no ratings; picking any item as test 
data is as good as the next. This isn't the case when we have ratings, and we 
do.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



0.3 release issues

2010-02-23 Thread Sean Owen
OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot.
It might not be such a sin to depend on 0.20.1. I believe it will
break the CF job in some instances, but, this is not going to affect
existing examples or unit tests (though it should :( ).


Re: 0.3 release issues

2010-02-23 Thread Robin Anil
No issues with 0.20.1 for vectorizer, classifier, fpgrowth and clustering.

What about lanczos and cf?


On Tue, Feb 23, 2010 at 5:09 PM, Sean Owen sro...@gmail.com wrote:

 OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot.
 It might not be such a sin to depend on 0.20.1. I believe it will
 break the CF job in some instances, but, this is not going to affect
 existing examples or unit tests (though it should :( ).



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837205#action_12837205
 ] 

Ankur commented on MAHOUT-305:
--

Well! not factoring ratings in the similarity metric but having them influence 
the train/test data for evaluation doesn't sound fair to me. So I don't think 
both of us agree on the evaluation methodology.  

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837209#action_12837209
 ] 

Sean Owen commented on MAHOUT-305:
--

They don't influence the similarity metric but they do influence the estimated 
ratings and therefore recommendations. Are we talking about the same algorithm? 
The last step is to multiply the co-occurrence matrix by the user -rating- 
vector.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.3 release issues

2010-02-23 Thread Grant Ingersoll
We can publish 0.20.2 on our site.  It's pretty easy to do.

On Feb 23, 2010, at 6:39 AM, Sean Owen wrote:

 OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot.
 It might not be such a sin to depend on 0.20.1. I believe it will
 break the CF job in some instances, but, this is not going to affect
 existing examples or unit tests (though it should :( ).



Re: 0.3 release issues

2010-02-23 Thread Sean Owen
Oh. I remember why. 0.20.1 doesn't exist in Maven.
We had our own version of Hadoop 0.20.1, but when I run with that I
get problems with it not finding Commons-Codec.

I understand why that is and we could fix that too but yes Grant would
be better, methinks, to publish our own copy of 0.20.2 for now and see
how that flies.

Er, how do we do that? Is it something you can describe, I can document and do?

On Tue, Feb 23, 2010 at 11:41 AM, Robin Anil robin.a...@gmail.com wrote:
 No issues with 0.20.1 for vectorizer, classifier, fpgrowth and clustering.

 What about lanczos and cf?


 On Tue, Feb 23, 2010 at 5:09 PM, Sean Owen sro...@gmail.com wrote:

 OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot.
 It might not be such a sin to depend on 0.20.1. I believe it will
 break the CF job in some instances, but, this is not going to affect
 existing examples or unit tests (though it should :( ).




Re: 0.3 release issues

2010-02-23 Thread Isabel Drost
On Tue Sean Owen sro...@gmail.com wrote:
 Er, how do we do that? Is it something you can describe, I can
 document and do?

It already has been described - and documented in our wiki:

http://cwiki.apache.org/MAHOUT/thirdpartydependencies.html

Hope that helps,
Isabel



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ankur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837230#action_12837230
 ] 

Ankur commented on MAHOUT-305:
--

*smile* There we go. 
Our last steps are essentially different. I don't do any multiplication, 
instead I just join (user, movie) on 'movie'  with co-occurrence set followed 
by a group on 'user' to calculate recommendations. I guess while joining I 
should multiply ratings with co-occurrence counts for better evaluation.

Can you give a small illustrative example with dummy data to describe your last 
steps? 

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837234#action_12837234
 ] 

Drew Farris commented on MAHOUT-301:


bq. including the job jar is much cleaner than adding all deps. Plus there is 
nothing more to configure to execute it on top of hadoop.. 

The job files work fine with 'hadoop jar', but putting the job files in the 
classspath will not automatically include the dependencies they contain (e.g 
commons-cli2) on the classpath: the dependencies need to be added separately 
(see the ClassNotFoundException case described above)

bq. BTW. How is hadoop execution done using shell script ?

If the HADOOP_CONF_DIR is set, it should be picked up by the jobs, but I don't 
think that means jar/jobfile execution works properly. I suspect this needs 
modifications to make that possible.


 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.3 release issues

2010-02-23 Thread Drew Farris
For hadoop, we probably should consider crafting a pom.xml file based
on the SNAPSHOT's pom.xml so that hadoop dependencies will be included
and we will not have to separately add them to Mahout. This also
promotes re-use of the jars we do deploy by third parties (if the
dependencies are correct).

The hadoop 0.20.2-SNAPSHOT pom is at:
https://repository.apache.org/service/local/repositories/snapshots/content/org/apache/hadoop/hadoop-core/0.20.2-SNAPSHOT/hadoop-core-0.20.2-20100112.125701-4.pom

We'd need to modify this to change the version and package name. It
probably would not be a bad idea to have a date/timestamp or svn
revision number from which the build was pulled somewhere in the
version, e.g: 0.20.2-r87654, so others can easily track down the
sources it was built from.

The command-line to deploy using a pom would be:

mvn gpg:sign-and-deploy-file
-Durl=https://repository.apache.org/service/local/staging/deploy/maven2
-DrepositoryId=apache.releases.https
-DgroupId=org.apache.mahout.hadoop -DartifactId=hadoop-core
-Dversion=0.20.2 -Dpackaging=jar -Dfile=hadoop-core-0.20.2.jar
-DpomFile=hadoop-core-0.20.2.pom

If someone doesn't get to it in the next 12 hours, I can take a look
at it this evening in EST, perhaps sooner.

The other option would be to wait until a 0.20.2 release is available,
which could be imminent. Last I saw on the list they were on rc4?

Drew

On Tue, Feb 23, 2010 at 7:23 AM, Isabel Drost isa...@apache.org wrote:
 On Tue Sean Owen sro...@gmail.com wrote:
 Er, how do we do that? Is it something you can describe, I can
 document and do?

 It already has been described - and documented in our wiki:

 http://cwiki.apache.org/MAHOUT/thirdpartydependencies.html

 Hope that helps,
 Isabel




[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837243#action_12837243
 ] 

Drew Farris commented on MAHOUT-301:


bq. BTW. How is hadoop execution done using shell script ? i.e

It looks like something like the following would do the trick

{code}
/bin/mahout -core org.apache.hadoop.util.RunJar 
/path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver 
TestClassifier
{code}

we could probably provide 'runjob' case that appends 
'org.apache.hadoop.util.RunJar examples/target/mahout-examples-0.3-SNAPSHOT.job 
org.apache.mahout.driver.MahoutDriver', but perhaps this could be used in every 
case that 'run' is called?



 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837245#action_12837245
 ] 

Sean Owen commented on MAHOUT-305:
--

Yeah it's just a small generalization -- what I'm up to reduces to the same 
thing if all ratings are 1. You can make it run faster if there are no ratings 
involved, I'm sure.

I'm going to send you a draft of chapter 6 of MiA which has a complete writeup 
on this.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.3 release issues

2010-02-23 Thread Grant Ingersoll

On Feb 23, 2010, at 8:47 AM, Drew Farris wrote:

 
 The other option would be to wait until a 0.20.2 release is available,
 which could be imminent. Last I saw on the list they were on rc4?

This doesn't seem horribly bad.  We should download and try the RC and provide 
feedback.

Re: 0.3 release issues

2010-02-23 Thread Sean Owen
It does look imminent. As much as I don't like holding out longer, and
indefinitely, for this release, somehow I'd also really like to link
to the latest/greatest and official Hadoop release.

Let's try to be good about sticking to the code freeze -- good chance
to focus on polish -- and if 0.20.2 isn't out by end of week, revisit
this.


Re: 0.3 release issues

2010-02-23 Thread Grant Ingersoll

On Feb 23, 2010, at 9:18 AM, Sean Owen wrote:

 It does look imminent. As much as I don't like holding out longer, and
 indefinitely, for this release, somehow I'd also really like to link
 to the latest/greatest and official Hadoop release.
 
 Let's try to be good about sticking to the code freeze -- good chance
 to focus on polish -- and if 0.20.2 isn't out by end of week, revisit
 this.

+1.   We might as well upgrade to the RC, too, by adding it as a dependency.


SVD for dummies

2010-02-23 Thread Grant Ingersoll
Hey Jake,

Was just going to ask for more insight into SVD when lo and behold, I checked 
my commits mail and saw 
http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction.  Very 
nice!

Thank you!

-Grant






Re: SVD for dummies

2010-02-23 Thread Robin Anil
--
Robin Anil
Blog: http://techdigger.wordpress.com
---

Mahout in Action - Mammoth Scale machine learning
Read Chapter 1 - Its Frrr
http://www.manning.com/owen

Try out Swipeball for iPhone
http://itunes.com/apps/swipeball



On Tue, Feb 23, 2010 at 8:23 PM, Grant Ingersoll gsing...@apache.orgwrote:

 Hey Jake,

 Was just going to ask for more insight into SVD when lo and behold, I
 checked my commits mail and saw
 http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction.
  Very nice!

 Thank you!

 -Grant







Re: SVD for dummies

2010-02-23 Thread Robin Anil
Sorry about that. My send button got it. And i forgot to turn undo on :(

What I was about to say was:

Its a great library Jake has given. I can't wait to test it by using the
reduced vectors in clustering


Robin


Re: [off-topic] Maven and SCP deploy.

2010-02-23 Thread Grant Ingersoll
Hmm, what version are you on?  I've done it successfully, but it usually 
requires some setup in your ~/.m2/settings.xml file to incorporate your public 
key, etc.  I think Mahout has it configured.  Check the How To Release page on 
the Wiki.

On Feb 23, 2010, at 3:16 AM, Dawid Weiss wrote:

 There are many folks knowledgeable about maven on this list, so I
 thought I'd ask -- I'm trying to write a POM with scp deployment, but
 maven consistently fails for me with authentication errors -- this is
 most likely caused by an outdated (and buggy) jsch dependency (0.1.38
 instead of 0.1.42). Anybody knows how to override this dependency from
 within the POM for wagon-ssh? I tried a dozen different
 configurations, but none work.
 
 Dawid




Re: 0.3 release issues

2010-02-23 Thread Isabel Drost
On Tue Grant Ingersoll gsing...@apache.org wrote:
 On Feb 23, 2010, at 9:18 AM, Sean Owen wrote:
 
  It does look imminent. As much as I don't like holding out longer,
  and indefinitely, for this release, somehow I'd also really like to
  link to the latest/greatest and official Hadoop release.
  
  Let's try to be good about sticking to the code freeze -- good
  chance to focus on polish -- and if 0.20.2 isn't out by end of
  week, revisit this.
 
 +1.   We might as well upgrade to the RC, too, by adding it as a
 dependency.

+1 (to both proposals)

Isabel


[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-02-23 Thread Danny Leshem (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837310#action_12837310
 ] 

Danny Leshem commented on MAHOUT-180:
-

While testing the new code, I encountered the following issue:

...
10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished.
10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input 
while processing Options
Usage:  
 [--eigenInput eigenInput --corpusInput corpusInput --help --output 
output --inMemory inMemory --maxError maxError --minEigenvalue
minEigenvalue]
Options  
...

The problem seems to be in DistributedLanczosSolver.java [73]:
EigenVerificationJob expects the parameters' names to be --eigenInput and 
--corpusInput, but you're mistakenly passing them as --input and --output.

Other than this minor issue, the code seems to be working fine and indeed 
produces the right amount of dense (eigen?) vectors.

 port Hadoop-ified Lanczos SVD implementation from decomposer
 

 Key: MAHOUT-180
 URL: https://issues.apache.org/jira/browse/MAHOUT-180
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.2
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.3

 Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, 
 MAHOUT-180.patch, MAHOUT-180.patch


 I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
 sparse matrices available at http://decomposer.googlecode.com/, which is 
 Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
 implementation to use Mahout vectors, or else add in these vectors as well.
 Current issues with the decomposer implementation include: if your matrix is 
 really big, you need to re-normalize before decomposition: find the largest 
 eigenvalue first, and divide all your rows by that value, then decompose, or 
 else you'll blow over Double.MAX_VALUE once you've run too many iterations 
 (the L^2 norm of intermediate vectors grows roughly as 
 (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
 the lower end is better than blowing over MAX_VALUE).  When this is ported to 
 Mahout, we should add in the capability to do this automatically (run a 
 couple iterations to find the largest eigenvalue, save that, then iterate 
 while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-02-23 Thread Danny Leshem (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837310#action_12837310
 ] 

Danny Leshem edited comment on MAHOUT-180 at 2/23/10 4:59 PM:
--

While testing the new code, I encountered the following issue:

...
10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished.
10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input 
while processing Options
Usage:  
 [--eigenInput eigenInput --corpusInput corpusInput --help --output 
output --inMemory inMemory --maxError maxError --minEigenvalue
minEigenvalue]
Options  
...

The problem seems to be in DistributedLanczosSolver.java [73]:
EigenVerificationJob expects the parameters' names to be eigenInput and 
corpusInput, but you're mistakenly passing them as input and output.

Other than this minor issue, the code seems to be working fine and indeed 
produces the right amount of dense (eigen?) vectors.

  was (Author: dleshem):
While testing the new code, I encountered the following issue:

...
10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished.
10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input 
while processing Options
Usage:  
 [--eigenInput eigenInput --corpusInput corpusInput --help --output 
output --inMemory inMemory --maxError maxError --minEigenvalue
minEigenvalue]
Options  
...

The problem seems to be in DistributedLanczosSolver.java [73]:
EigenVerificationJob expects the parameters' names to be --eigenInput and 
--corpusInput, but you're mistakenly passing them as --input and --output.

Other than this minor issue, the code seems to be working fine and indeed 
produces the right amount of dense (eigen?) vectors.
  
 port Hadoop-ified Lanczos SVD implementation from decomposer
 

 Key: MAHOUT-180
 URL: https://issues.apache.org/jira/browse/MAHOUT-180
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.2
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.3

 Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, 
 MAHOUT-180.patch, MAHOUT-180.patch


 I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
 sparse matrices available at http://decomposer.googlecode.com/, which is 
 Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
 implementation to use Mahout vectors, or else add in these vectors as well.
 Current issues with the decomposer implementation include: if your matrix is 
 really big, you need to re-normalize before decomposition: find the largest 
 eigenvalue first, and divide all your rows by that value, then decompose, or 
 else you'll blow over Double.MAX_VALUE once you've run too many iterations 
 (the L^2 norm of intermediate vectors grows roughly as 
 (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
 the lower end is better than blowing over MAX_VALUE).  When this is ported to 
 Mahout, we should add in the capability to do this automatically (run a 
 couple iterations to find the largest eigenvalue, save that, then iterate 
 while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837324#action_12837324
 ] 

Jake Mannix commented on MAHOUT-180:


Hi Danny, thanks for trying this out!  

You have indeed found some testing code which snuck in - I was trying to add 
the EigenVerificationJob to the final step of Lanczos, to save people the 
trouble of having to clean their eigenvectors at the end of the day, but 
didn't finish and yet it got checked in.  

The clue in the code is that I still have a line:
{code}
 // TODO ack!
{code}
Which should be a hint that I should not have checked that file in just yet. :)

I've removed it now - svn up and try again!  

If you want to see what your eigen-spectrum is like, after you've run the 
DistributedLanczosSolver, the EigenVerificationJob can be run next (it cleans 
out eigenvectors with too high error or too low eigenvalue):

{code}
$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-{version}.job 
org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob \
--eigenInput path/for/svd-output --corpusInput path/to/corpus --output 
path/for/cleanOutput --maxError 0.1 --minEigenvalue 10.0 
{code}

Thanks for the bug report!

 port Hadoop-ified Lanczos SVD implementation from decomposer
 

 Key: MAHOUT-180
 URL: https://issues.apache.org/jira/browse/MAHOUT-180
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.2
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.3

 Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, 
 MAHOUT-180.patch, MAHOUT-180.patch


 I wrote up a hadoop version of the Lanczos algorithm for performing SVD on 
 sparse matrices available at http://decomposer.googlecode.com/, which is 
 Apache-licensed, and I'm willing to donate it.  I'll have to port over the 
 implementation to use Mahout vectors, or else add in these vectors as well.
 Current issues with the decomposer implementation include: if your matrix is 
 really big, you need to re-normalize before decomposition: find the largest 
 eigenvalue first, and divide all your rows by that value, then decompose, or 
 else you'll blow over Double.MAX_VALUE once you've run too many iterations 
 (the L^2 norm of intermediate vectors grows roughly as 
 (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on 
 the lower end is better than blowing over MAX_VALUE).  When this is ported to 
 Mahout, we should add in the capability to do this automatically (run a 
 couple iterations to find the largest eigenvalue, save that, then iterate 
 while scaling vectors by 1/max_eigenvalue).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.3 release issues

2010-02-23 Thread Ted Dunning
+1 to code freeze, waiting for hadoop release and testing the RC

On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote:

 On Tue Grant Ingersoll gsing...@apache.org wrote:
  On Feb 23, 2010, at 9:18 AM, Sean Owen wrote:
 
   It does look imminent. As much as I don't like holding out longer,
   and indefinitely, for this release, somehow I'd also really like to
   link to the latest/greatest and official Hadoop release.
  
   Let's try to be good about sticking to the code freeze -- good
   chance to focus on polish -- and if 0.20.2 isn't out by end of
   week, revisit this.
 
  +1.   We might as well upgrade to the RC, too, by adding it as a
  dependency.

 +1 (to both proposals)

 Isabel




-- 
Ted Dunning, CTO
DeepDyve


Re: 0.3 release issues

2010-02-23 Thread Jake Mannix
So to be an annoying voice of dissent... I'm going to keep iterating on
MAHOUT-301,
targetted for 0.4, and I will keep it in patch form (not checked in) _for
now_... but
if it can get its wrinkles ironed out before Hadoop gets its act together, I
really
think it should get committed to 0.3.

It's minimally invasive (one java file, and some config files, all additions
(not
changes), and then changes to the mahout shell script), and has the
potential
to really make repeatedly running our tools far easier than it currently is.

  -jake

ps.  if we are really doing a code freeze, can we make a dev branch (or
more appropriately, make a 0.3 release branch, and allow continued
development
on trunk)?  We don't really want to hold up on producing new stuff, do we?
More more MORE! ;)

On Tue, Feb 23, 2010 at 9:25 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 +1 to code freeze, waiting for hadoop release and testing the RC

 On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote:

  On Tue Grant Ingersoll gsing...@apache.org wrote:
   On Feb 23, 2010, at 9:18 AM, Sean Owen wrote:
  
It does look imminent. As much as I don't like holding out longer,
and indefinitely, for this release, somehow I'd also really like to
link to the latest/greatest and official Hadoop release.
   
Let's try to be good about sticking to the code freeze -- good
chance to focus on polish -- and if 0.20.2 isn't out by end of
week, revisit this.
  
   +1.   We might as well upgrade to the RC, too, by adding it as a
   dependency.
 
  +1 (to both proposals)
 
  Isabel
 



 --
 Ted Dunning, CTO
 DeepDyve



Re: 0.3 release issues

2010-02-23 Thread Sean Owen
I say use your judgment. If you feel confident enough for it to be
enshrined for about 3 months in an official release, check it in. Soon
we will indeed want a proper release branch. This time it shouldn't be
more than a few days, and, probably good discipline to force everyone
to do nothing but documentation and test improvements for that time.

On Tue, Feb 23, 2010 at 5:40 PM, Jake Mannix jake.man...@gmail.com wrote:
 So to be an annoying voice of dissent... I'm going to keep iterating on
 MAHOUT-301,
 targetted for 0.4, and I will keep it in patch form (not checked in) _for
 now_... but
 if it can get its wrinkles ironed out before Hadoop gets its act together, I
 really
 think it should get committed to 0.3.

 It's minimally invasive (one java file, and some config files, all additions
 (not
 changes), and then changes to the mahout shell script), and has the
 potential
 to really make repeatedly running our tools far easier than it currently is.

  -jake

 ps.  if we are really doing a code freeze, can we make a dev branch (or
 more appropriately, make a 0.3 release branch, and allow continued
 development
 on trunk)?  We don't really want to hold up on producing new stuff, do we?
 More more MORE! ;)



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837345#action_12837345
 ] 

Jake Mannix commented on MAHOUT-301:


Hey Drew, thanks for looking at this.  Problems you saw are probably what are 
known as bugs. :)
{quote}
Did some testing, here's a patch to clean some of these things up + a couple 
questions:
Could we load the default driver.classes.props from the classpath? If it was 
loaded that way the default would work regardless of where the mahout script is 
run from (it currently only works if ./bin/mahout is run, not ./mahout for 
example) and regardless of whether we're running from a binary release or the 
dev environment. (included in patch)
{quote}

YES!  We should indeed load from classpath.  My most recent version of this 
patch (which isn't posted, because it conflicts with yours, I'm trying to 
resolve that now) changes it so that you just supply a single directory in 
which driver.classes.props and the shortNames.props files are located.

{quote}
Something else I noticed is that the 'mahout' script doesn't add the classes in 
$MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in 
that it can't run anything, e.g:

./mahout vectordump
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/commons/cli2/OptionException
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.cli2.OptionException
(fixed in patch)
{code}

This wasn't a problem with my patch, right?  That was an issue of the mahout 
script in trunk itself?  

{code}
Using -core in the context of a dev build should work properly, but leaving out 
-core will cause the script to error unless run in the context of a release - 
this is the way it should work, right?
{code}

What is the -core option for?  I've never used it, how does it work?

{code}
Also added a help message for the 'run' argument.
{code}

Where did you add that?

{code}
Does executing './mahout run --help' hang for anyone else or is it something 
specific to my environment? (didn't track this one down)
{code}

The --help option I didn't have in there, you added it, do you know where it's 
hanging?

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code 

Re: 0.3 release issues

2010-02-23 Thread Ted Dunning
WHat about a new follow-on JIRA so 301 can stay in the official release
notes?

On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com wrote:

 So to be an annoying voice of dissent... I'm going to keep iterating on
 MAHOUT-301,
 targetted for 0.4, and I will keep it in patch form (not checked in) _for
 now_... but
 if it can get its wrinkles ironed out before Hadoop gets its act together,
 I
 really
 think it should get committed to 0.3.




-- 
Ted Dunning, CTO
DeepDyve


Re: 0.3 release issues

2010-02-23 Thread Jake Mannix
What does this mean?  You mean make a 301-continuation ticket for 0.4, and
reschedule the original 301 for 0.3?  I could do that *if* 301 is in good
shape
by the time Hadoop is ready, but I don't want to reschedule it until it is.

  -jake

On Tue, Feb 23, 2010 at 9:57 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 WHat about a new follow-on JIRA so 301 can stay in the official release
 notes?

 On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com
 wrote:

  So to be an annoying voice of dissent... I'm going to keep iterating on
  MAHOUT-301,
  targetted for 0.4, and I will keep it in patch form (not checked in) _for
  now_... but
  if it can get its wrinkles ironed out before Hadoop gets its act
 together,
  I
  really
  think it should get committed to 0.3.
 



 --
 Ted Dunning, CTO
 DeepDyve



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837351#action_12837351
 ] 

Jake Mannix commented on MAHOUT-301:


Ok, Drew, got your patch in diff mode against mine finally.  

So you already added the ability to load via classpath, right?  If we merge 
that way of thinking with what I'm currently working on (having a configurable 
MAHOUT_CONF_DIR which is used for all these props files), we could just have 
the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you 
already have it adding the hardwired core/src/main/resources directory) and 
then it would work that way.

New patch merging yours with mine forthcoming.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.3 release issues

2010-02-23 Thread Grant Ingersoll
I assume he means mark 301 as fixed (when appropriate) and then open a new 
ticket for follow on work marked for 0.4.

-Grant

On Feb 23, 2010, at 1:03 PM, Jake Mannix wrote:

 What does this mean?  You mean make a 301-continuation ticket for 0.4, and
 reschedule the original 301 for 0.3?  I could do that *if* 301 is in good
 shape
 by the time Hadoop is ready, but I don't want to reschedule it until it is.
 
  -jake
 
 On Tue, Feb 23, 2010 at 9:57 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 WHat about a new follow-on JIRA so 301 can stay in the official release
 notes?
 
 On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com
 wrote:
 
 So to be an annoying voice of dissent... I'm going to keep iterating on
 MAHOUT-301,
 targetted for 0.4, and I will keep it in patch form (not checked in) _for
 now_... but
 if it can get its wrinkles ironed out before Hadoop gets its act
 together,
 I
 really
 think it should get committed to 0.3.
 
 
 
 
 --
 Ted Dunning, CTO
 DeepDyve
 



Re: 0.3 release issues

2010-02-23 Thread Ted Dunning
Exactly.

On Tue, Feb 23, 2010 at 10:16 AM, Grant Ingersoll gsing...@apache.orgwrote:

 I assume he means mark 301 as fixed (when appropriate) and then open a new
 ticket for follow on work marked for 0.4.

 -Grant

 On Feb 23, 2010, at 1:03 PM, Jake Mannix wrote:

  What does this mean?  You mean make a 301-continuation ticket for 0.4,
 and
  reschedule the original 301 for 0.3?  I could do that *if* 301 is in good
  shape
  by the time Hadoop is ready, but I don't want to reschedule it until it
 is.
 
   -jake
 
  On Tue, Feb 23, 2010 at 9:57 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  WHat about a new follow-on JIRA so 301 can stay in the official release
  notes?
 
  On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com
  wrote:
 
  So to be an annoying voice of dissent... I'm going to keep iterating on
  MAHOUT-301,
  targetted for 0.4, and I will keep it in patch form (not checked in)
 _for
  now_... but
  if it can get its wrinkles ironed out before Hadoop gets its act
  together,
  I
  really
  think it should get committed to 0.3.
 
 
 
 
  --
  Ted Dunning, CTO
  DeepDyve
 




-- 
Ted Dunning, CTO
DeepDyve


Re: 0.3 release issues

2010-02-23 Thread Drew Farris
FWIW, I think we're pretty close on MAHOUT-301. I also think there may
be a couple issues with what's in there now for bin/mahout (pre-301).
So I'd say +1 for a freeze except for MAHOUT-301 (which I guess
isn't a freeze then :-D)

+1 for switching to the hadoop rc for testing.

On Tue, Feb 23, 2010 at 12:40 PM, Jake Mannix jake.man...@gmail.com wrote:
 So to be an annoying voice of dissent... I'm going to keep iterating on
 MAHOUT-301,
 targetted for 0.4, and I will keep it in patch form (not checked in) _for
 now_... but
 if it can get its wrinkles ironed out before Hadoop gets its act together, I
 really
 think it should get committed to 0.3.

 It's minimally invasive (one java file, and some config files, all additions
 (not
 changes), and then changes to the mahout shell script), and has the
 potential
 to really make repeatedly running our tools far easier than it currently is.

  -jake

 ps.  if we are really doing a code freeze, can we make a dev branch (or
 more appropriately, make a 0.3 release branch, and allow continued
 development
 on trunk)?  We don't really want to hold up on producing new stuff, do we?
 More more MORE! ;)

 On Tue, Feb 23, 2010 at 9:25 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 +1 to code freeze, waiting for hadoop release and testing the RC

 On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote:

  On Tue Grant Ingersoll gsing...@apache.org wrote:
   On Feb 23, 2010, at 9:18 AM, Sean Owen wrote:
  
It does look imminent. As much as I don't like holding out longer,
and indefinitely, for this release, somehow I'd also really like to
link to the latest/greatest and official Hadoop release.
   
Let's try to be good about sticking to the code freeze -- good
chance to focus on polish -- and if 0.20.2 isn't out by end of
week, revisit this.
  
   +1.   We might as well upgrade to the RC, too, by adding it as a
   dependency.
 
  +1 (to both proposals)
 
  Isabel
 



 --
 Ted Dunning, CTO
 DeepDyve




[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837378#action_12837378
 ] 

Ted Dunning commented on MAHOUT-305:



My own experience is that all that counts in recommendations is the probability 
of click (interest) on a set of recommendations.  As such, the best analog is 
probably precision at 10 or 20.  I don't think that recall at 10 or 20 makes 
any sense at all (with a depth limited situation like this, you have given up 
on recall and are only looking at precision).

Ankur's suggestion about keeping the most recent 4's and 5's as test data seems 
right to me.  My only beefs are that you don't need rec...@10 and what to do 
with the unrated items.  Presumably a new style algorithm could surface items 
that the user hadn't thought of, but really likes.  In practice, I think that 
counting unrated items in the results as misses isn't a big deal in the Netflix 
data.  In the real world where test data is more scarce, I would count unrated 
items as misses in off-line evaluation, but try to run as many alternatives as 
possible against live users.

 

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837376#action_12837376
 ] 

Drew Farris commented on MAHOUT-301:


{quote}
This wasn't a problem with my patch, right?  That was an issue of the mahout 
script in trunk itself?
{quote}

Yes it was a problem with the script in trunk. I believe this was due to the 
fact that the job files were on the classpath instead of all of the dependency 
jars. Adding the job files to the classpath does not add the dependency jars 
they contain to the classpath as well. So, no you didn't add this, but it 
should be fixed (and is in the patch)

{quote}
What is the -core option for?  I've never used it, how does it work?
{quote}

when you're running bin/mahout in the context of a build the -core option is 
used to tell it to use the build classpath instead of the classpath used for a 
binary release. This just follows the pattern established (by Doug?) in the 
hadoop and nutch launch scripts.

{quote}
Also added a help message for the 'run' argument.
{quote}

near line 72 in bin/mahout:
(this is different from the --help question I had)

{code}
  echo   seq2sparsegenerate sparse vectors from a sequence file
  echo   vectordumpdump vectors from a sequence file
  echo   run   run mahout tasks using the MahoutDriver, see: 
http://cwiki.apache.org/MAHOUT/mahoutdriver.html;
{code}

{quote}
So you already added the ability to load via classpath, right? If we merge that 
way of thinking with what I'm currently working on (having a configurable 
MAHOUT_CONF_DIR which is used for all these props files), we could just have 
the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you 
already have it adding the hardwired core/src/main/resources directory) and 
then it would work that way.
{quote}

Yep, that should do it, as long as MAHOUT_CONF_DIR appears before 
src/main/resources, we should be good to go. It should be added outside of the 
section of the script that determines if -core has been specified on the 
command-line.



 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can 

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837377#action_12837377
 ] 

Ted Dunning commented on MAHOUT-305:



My own experience is that all that counts in recommendations is the probability 
of click (interest) on a set of recommendations.  As such, the best analog is 
probably precision at 10 or 20.  I don't think that recall at 10 or 20 makes 
any sense at all (with a depth limited situation like this, you have given up 
on recall and are only looking at precision).

Ankur's suggestion about keeping the most recent 4's and 5's as test data seems 
right to me.  My only beefs are that you don't need rec...@10 and what to do 
with the unrated items.  Presumably a new style algorithm could surface items 
that the user hadn't thought of, but really likes.  In practice, I think that 
counting unrated items in the results as misses isn't a big deal in the Netflix 
data.  In the real world where test data is more scarce, I would count unrated 
items as misses in off-line evaluation, but try to run as many alternatives as 
possible against live users.

 

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: SVD for dummies

2010-02-23 Thread mike bowles
Here's another source of understanding about svd.  MIT Professor Gilbert
Strang's lecture on the subject.  Strang is a fine educator and gentleman
and wonderfully clear with his explanation of the underlying geometry of
SVD.  

http://videolectures.net/mit1806s05_strang_lec29/

you can also get find Strang's lecture and others by searching singular
value decomposition on YouTube.  
Mike



-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, February 23, 2010 6:54 AM
To: Mahout Dev List
Subject: SVD for dummies

Hey Jake,

Was just going to ask for more insight into SVD when lo and behold, I
checked my commits mail and saw
http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction.
Very nice!

Thank you!

-Grant








Re: [off-topic] Maven and SCP deploy.

2010-02-23 Thread Dawid Weiss
 Hmm, what version are you on?  I've done it successfully, but it usually 
 requires some setup in your

Apache Maven 2.2.1 (r801777; 2009-08-06 21:16:01+0200)

I'm using POM configuration based on an identical parent as Mahout,
but no success. Interestingly, the default scp (built-in) works on one
server, but fails on another. I've seen tons of e-mails about this on
the internet, but apparently the problem is due to an outdated jsch...
It may be this particular server's SSH configuration, although
command-line SCP does work (and so does SSH). I've tried in all
possible configurations -- scp, scpexe (external scp), private key and
password authentication. scpexe was the best of all the configurations
and it worked from time to time... but not always.

Can you see what your set of plugins is, Grant (if you issue mvn -X
... | grep jsch it'll print jsch plugins actually used).

I did manage to set up an ftp deployment, so this is not crucial for
me anymore, but I'd rather have scp than ftp.

Thanks,
Dawid

~/.m2/settings.xml file to incorporate your public key, etc.  I think
Mahout has it configured.  Check the How To Release page on the Wiki.

 On Feb 23, 2010, at 3:16 AM, Dawid Weiss wrote:

 There are many folks knowledgeable about maven on this list, so I
 thought I'd ask -- I'm trying to write a POM with scp deployment, but
 maven consistently fails for me with authentication errors -- this is
 most likely caused by an outdated (and buggy) jsch dependency (0.1.38
 instead of 0.1.42). Anybody knows how to override this dependency from
 within the POM for wagon-ssh? I tried a dozen different
 configurations, but none work.

 Dawid





Re: [off-topic] Maven and SCP deploy.

2010-02-23 Thread Ted Dunning
Can you over-ride that dependency?

On Tue, Feb 23, 2010 at 12:07 PM, Dawid Weiss dawid.we...@gmail.com wrote:

 but apparently the problem is due to an outdated jsch




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837428#action_12837428
 ] 

Jake Mannix commented on MAHOUT-301:


{quote}
Something else I noticed is that the 'mahout' script doesn't add the classes in 
$MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in 
that it can't run anything, e.g:
{quote}

{quote}
Also wondering what the purpose of adding the job jars to the classpath is? 
(removed in patch)
{quote}

When I run locally now, not using -core, I get this failure:
{code}
/bin/mahout vectordump -s wiki-sparse-vectors-out/vectors/part-0
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/mahout/utils/vectors/VectorDumper
{code}

This appears to be because your patch has CLASSPATH set to add on things like 
$MAHOUT_HOME/mahout-*.jar, which doesn't exist after I've done mvn install.  
Is there another maven target I need to use to generate the release jars in 
$MAHOUT_HOME?

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837434#action_12837434
 ] 

Drew Farris commented on MAHOUT-301:


Jake, the basic idea is that you would always use -core when executing from 
within a build, but you would not use core when executing in the context of a 
binary release.

The binary release, built using mvn -Prelease, lands in 
target/mahout-0.3-SNAPSHOT.tar.gz, untar that and try running bin/mahout from 
the directory that's created and that should work fine without -core

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837440#action_12837440
 ] 

Jake Mannix commented on MAHOUT-301:


{quote}
Jake, the basic idea is that you would always use -core when executing from 
within a build, but you would not use core when executing in the context of a 
binary release.
{quote}

Hmm... ok.  I'm a little reticent about running -core when testing, because I'm 
not really testing what the release run will be like - I like the idea of 
having a single set of dependencies (jars, not classes directories) which are 
used locally, and the .job when hitting a remote hadoop cluster.  Maybe I'm 
just not familiar with the -core option and it's use.  

So far, I've always run by the process of 

 * make code/config changes
 * run mvn clean install (sometimes with -DskipTests if I'm doing rapid 
iterations)
 * run mahout comand args OR
 * hadoop jar examples/target/mahout-examples-{version}.job classname args

The last step, as you've noted, is because I'm not sure that the script 
actually properly lets HADOOP_CONF_DIR properly get passed through the mahout 
shell script to actually running on the hadoop cluster, but maybe that's just a 
config issue in my case?  Also means that in fact the default properties idea 
still doesn't work on hadoop, unless the default properties files are pushed to 
the classpath.

Maybe a kludgey way to do it would be for the script to grab the properties 
files from the MAHOUT_CONF_DIR, unzip the release job jar, push them into it, 
and re-jar it back up and then give it to hadoop, and now those files will be 
available on the classpath of the running job on the remote cluster? 

What is the right way run a job with some additional (runtime) files added to 
the job's classpath?  Is there some cmdline arg to hadoop that I'm forgetting?

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837448#action_12837448
 ] 

Drew Farris commented on MAHOUT-301:


{quote}
Hmm... ok. I'm a little reticent about running -core when testing, because I'm 
not really testing what the release run will be like - I like the idea of 
having a single set of dependencies (jars, not classes directories) which are 
used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just 
not familiar with the -core option and it's use.
{quote}

Ahh, I see where you're coming from, so without core, you're suggesting that 
mahout pick up the jar files in the target directories if they exist? I think 
it is fine to modify the non-core classpath to include these, they won't be 
present in the release build anyway.

{quote}
The last step, as you've noted, is because I'm not sure that the script 
actually properly lets HADOOP_CONF_DIR properly get passed through the mahout 
shell script to actually running on the hadoop cluster, but maybe that's just a 
config issue in my case? Also means that in fact the default properties idea 
still doesn't work on hadoop, unless the default properties files are pushed to 
the classpath.
{quote}

Are any of the default properties files used beyond the MahoutDriver, which 
executes locally and sets up the job? Do these files need to be distributed to 
the rest of the cluster? As noted above, I think the proper way to run 
MahoutDriver in the context of a distributed job is to do something like:

{code}
./bin/mahout org.apache.hadoop.util.RunJar 
/path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver 
TestClassifier
{code}

I suspect we could easilly modify the mahout script and shorten this to:

{code}
./bin/mahout runjob TestClassifier
{code}

I can look at this a little closer tonight, so if you have an updated patch for 
me to work on/test in a few hours, definitely post it. I'd be happy to make any 
changes you're interested in.

{quote}
What is the right way run a job with some additional (runtime) files added to 
the job's classpath? Is there some cmdline arg to hadoop that I'm forgetting?
{quote}

FWIW, 
[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser]
 provides a way to do this with -files, -libjars and -archives


 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could 

[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837472#action_12837472
 ] 

Jake Mannix commented on MAHOUT-301:


{quote}
Ahh, I see where you're coming from, so without core, you're suggesting that 
mahout pick up the jar files in the target directories if they exist? I think 
it is fine to modify the non-core classpath to include these, they won't be 
present in the release build anyway.
{quote}

Cool, yeah, that makes sense.
{quote}
Are any of the default properties files used beyond the MahoutDriver, which 
executes locally and sets up the job? Do these files need to be distributed to 
the rest of the cluster? As noted above, I think the proper way to run 
MahoutDriver in the context of a distributed job is to do something like:
{code}
./bin/mahout org.apache.hadoop.util.RunJar 
/path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver 
TestClassifier
{code}
I suspect we could easilly modify the mahout script and shorten this to:
{code}
./bin/mahout runjob TestClassifier
{code}
{quote}

Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do 
runjob as described, if it's not, do run to do locally.

{quote}
FWIW, 
[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser]
 provides a way to do this with -files, -libjars and -archives
{quote}

Now of course, I guess I don't really need the files to get onto the job's 
classpath *on the cluster* - it just needs to be on the classpath of the 
locally running jvm which is invoking MahoutDriver.main().  So I was doing more 
work than was necessary.  This is easy to do, just add MAHOUT_CONF_DIR to the 
classpath and we're good to go.



 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837477#action_12837477
 ] 

Drew Farris commented on MAHOUT-301:


bq. Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, 
do runjob as described, if it's not, do run to do locally.

Yes, ok -- that should work because I believe you can use RunJar to launch 
anything even if it isn't a mapreduce job, no need for classpath setup in this 
case either -- all you need to do is point to the examples job. Might be able 
to take advantage of this elsewhere.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [off-topic] Maven and SCP deploy.

2010-02-23 Thread Ted Dunning
barely.  This article might help:


http://unitstep.net/blog/2009/05/18/resolving-log4j-1215-dependency-problems-in-maven-using-exclusions/

On Tue, Feb 23, 2010 at 12:20 PM, Dawid Weiss dawid.we...@gmail.com wrote:

 do you know how/where to place such an
 override?




-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-02-23 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837528#action_12837528
 ] 

Ted Dunning commented on MAHOUT-305:


{quote}
Yeah in this context there's no choice but to count unrated items as misses. My 
intuition based on limited experience is it is in fact an issue - are the best 
items for a user typically found among their ratings in real-world data sets? I 
just can't imagine it's so for most users, who express few ratings.
{quote}

This suggests that mean reciprocal rank (MRR) of the top 5 or 10 highly rated 
items might be a useful measure.  Even if the top 10 has several unrated good 
choices, if the rated choices are all pretty high then you can have pretty good 
feelings even if they didn't quite make the top 10.

 Combine both cooccurrence-based CF M/R jobs
 ---

 Key: MAHOUT-305
 URL: https://issues.apache.org/jira/browse/MAHOUT-305
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.2
Reporter: Sean Owen
Assignee: Ankur
Priority: Minor

 We have two different but essentially identical MapReduce jobs to make 
 recommendations based on item co-occurrence: 
 org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
 merged. Not sure exactly how to approach that but noting this in JIRA, per 
 Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-301:
---

Attachment: MAHOUT-301.patch

Ok, new patch.

This one works in one of two ways.  If you have $MAHOUT_CONF_DIR defined (there 
are some dummy files living in the newly created directory conf at the top 
level, moving away from core/src/main/resources), then you can just run:

{code}
$MAHOUT_HOME/bin/mahout run svd
{code}

and it should read your properties in $MAHOUT_CONF_DIR/svd.props and run 
(locally).

The other way it can work (and actually does, at least on my setup) is running 
on hadoop:

{code}
$HADOOP_HOME/bin/hadoop jar path/to/mahout.job 
org.apache.mahout.driver.MahoutDriver svd 
{code}

And again, $MAHOUT_CONF_DIR/svd.props is read locally before being launched off 
to the hadoop cluster.

I have not yet been able to get the idea of turning the shell script into 
automagically issuing RunJar as the command and passing MahoutDriver and the 
remaining args after, so that you would never need to run hadoop's shell script 
at all, although that would be great to have work.

Also not yet in this patch: actually default set MAHOUT_CONF_DIR to the correct 
place in both dev mode and release mode, and I haven't modified the pom to 
package up the new conf dir and put it in the distribution.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837607#action_12837607
 ] 

Drew Farris commented on MAHOUT-301:


It doesn't appear that the following command works as intended:

{code}
./bin/mahout org.apache.hadoop.util.RunJar 
/path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver 
TestClassifier
{code}

The following seems to be the appropriate way to achieve what we're trying to 
do here: 

{code}
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job 
org.apache.mahout.driver.MahoutDriver TestClassifier
{code}

Any thoughts on whether it makes sense to attempt to work the latter form into 
the mahout script? It won't pull the necessary config files for MahoutDriver in 
from a path outside of the job file unless HADOOP_CLASSPATH is set to include 
those directories, but I haven't had a chance to verify that.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837616#action_12837616
 ] 

Jake Mannix commented on MAHOUT-301:


Our comments crossed in the ether! :)

{quote}
Any thoughts on whether it makes sense to attempt to work the latter form into 
the mahout script? It won't pull the necessary config files for MahoutDriver in 
from a path outside of the job file unless HADOOP_CLASSPATH is set to include 
those directories, but I haven't had a chance to verify that.
{quote}

You're right - I did indeed set my HADOOP_CLASSPATH to include 
$MAHOUT_CONF_DIR, which allowed this to work, otherwise it would not.  This 
should be done by the script.

Ideally, yes, it's ugly but if $MAHOUT_HOME/bin/mahout just sets 
$HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR (or $MAHOUT_HOME/conf if that 
variable is not set), then just execute $HADOOP_HOME/bin/hadoop jar ... then it 
should work.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files

2010-02-23 Thread Jake Mannix (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-301:
---

Attachment: MAHOUT-301.patch

Ok, now we're getting somewhere.  This one a) has the ability to properly 
handle mahout run -h or mahout run --help, helpfully spitting out the list 
of classes with shortName's which MahoutDriver has been told about in the 
driver.classes.props, and more importantly, it can, both in a release 
environment, and in a dev environment, do:

{code}
./bin/mahout run kmeans [options]
{code}

If $MAHOUT_CONF_DIR is set, and points to a place with the right files, then 
the default properties are loaded from there (overridden by [options] given 
above). 

If both $HADOOP_HOME and $HADOOP_CONF_DIR are set, then this actually sets 
$HADOOP_CLASSPATH to be prepended with $MAHOUT_CONF_DIR so that the following 
is actually run:

{code}
$HADOOP_HOME/bin/hadoop jar [path to examples.job] o.a.m.driver.MahoutDriver 
kmeans [options]
{code}

actually works and it gets the default properties loaded and overridden as 
necessary, running your job on the hadoop cluster.

If one of those variables are not specified (TODO: if $HADOOP_HOME is 
specified, but $HADOOP_CONF_DIR is not, guess a default of $HADOOP_HOME/conf, I 
suppose), then the assumption is to run locally.

Previous behavior still works, from what I can tell - you can still do:

{code}
$MAHOUT_HOME/bin/mahout kmeans --output kmeans/out --input input/vecs -k 13 
--clusters tmp/foobar
{code}

and we're backwards compatible with the old way.

Now the question is: do we want to be?  Or do we want to trim down the shell 
script to just always use MahoutDriver, and get rid of all of the 'elif [ 
$COMMAND =' stuff and just have $CLASS be MahoutDriver, passing it $COMMAND 
as the first argument?  

Then the command line would be exactly the same as before, except you could 
also load up your $MAHOUT_CONF_DIR/shortName.props files with whatever 
defaults you wanted to use.

 Improve command-line shell script by allowing default properties files
 --

 Key: MAHOUT-301
 URL: https://issues.apache.org/jira/browse/MAHOUT-301
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Affects Versions: 0.3
Reporter: Jake Mannix
Assignee: Jake Mannix
Priority: Minor
 Fix For: 0.4

 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, 
 MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch


 Snippet from javadoc gives the idea:
 {code}
 /**
  * General-purpose driver class for Mahout programs.  Utilizes 
 org.apache.hadoop.util.ProgramDriver to run
  * main methods of other classes, but first loads up default properties from 
 a properties file.
  *
  * Usage: run on Hadoop like so:
  *
  * $HADOOP_HOME/bin/hadoop -jar path/to/job 
 org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
  *   [default.props file for this class] [over-ride options, all specified in 
 long form: --input, --jarFile, etc]
  *
  * TODO: set the Main-Class to just be MahoutDriver, so that this option 
 isn't needed?
  *
  * (note: using the current shell scipt, this could be modified to be just 
  * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props 
 file] [over-ride options]
  * )
  *
  * Works like this: by default, the file 
 core/src/main/resources/driver.classes.prop is loaded, which
  * defines a mapping between short names like VectorDumper and fully 
 qualified class names.  This file may
  * instead be overridden on the command line by having the first argument be 
 some string of the form *classes.props.
  *
  * The next argument to the Driver is supposed to be the short name of the 
 class to be run (as defined in the
  * driver.classes.props file).  After this, if the next argument ends in 
 .props / .properties, it is taken to
  * be the file to use as the default properties file for this execution, and 
 key-value pairs are built up from that:
  * if the file contains
  *
  * input=/path/to/my/input
  * output=/path/to/my/output
  *
  * Then the class which will be run will have it's main called with
  *
  *   main(new String[] { --input, /path/to/my/input, --output, 
 /path/to/my/output });
  *
  * After all the default properties are loaded from the file, any further 
 command-line arguments are taken in,
  * and over-ride the defaults.
  */
 {code}
 Could be cleaned up, as it's kinda ugly with the whole file named in 
 .props, but gives the idea.  Really helps cut down on repetitive long 
 command lines, lets defaults be put props files instead of locked into the 
 code also.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.3 release issues

2010-02-23 Thread Jeff Eastman

+1 from me too

Ted Dunning wrote:

+1 to code freeze, waiting for hadoop release and testing the RC

On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote:

  

On Tue Grant Ingersoll gsing...@apache.org wrote:


On Feb 23, 2010, at 9:18 AM, Sean Owen wrote:

  

It does look imminent. As much as I don't like holding out longer,
and indefinitely, for this release, somehow I'd also really like to
link to the latest/greatest and official Hadoop release.

Let's try to be good about sticking to the code freeze -- good
chance to focus on polish -- and if 0.20.2 isn't out by end of
week, revisit this.


+1.   We might as well upgrade to the RC, too, by adding it as a
dependency.
  

+1 (to both proposals)

Isabel






  




RE: Algorithm implementations in Pig

2010-02-23 Thread Palleti, Pallavi
I too have mixed opinion w.r.t pig. Pig would be a good choice to
quickly prototype and test. However, following are the pitfalls I have
observed in pig.

It is not easy to debug in pig. Also, it have performance issues as it
is a layer on top of hadoop, so the overhead of converting pig into
map-reduce code. Also, when the code is available in hadoop, it is in
developer/user's hand to improve the performance by using various
parameters say, no of mappers, different input formats, etc. However is
not the case with pig. Also,there are some compatibility issues with pig
and hadoop. Say, if I am using pig-x version on hadoop-y version, there
might be some compatibility issues and need to spend time on resolving
the same as it is not easy to figure out the errors. 
I believe the main motto of mahout is to propose scalable algorithms
which can be used to solve some real world problems. In such case, if
pig has got rid of above pitfalls, then it would be good choice as we
will have very less developing time efforts. 

Thanks
Pallavi

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Monday, February 22, 2010 11:32 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

As an interesting test case, can you write a pig program that counts
words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com
wrote:


 That isn't an issue here.  It is the invocation of pig programs and 
 passing useful information to them that is the problem.


 On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel
gan...@yahoo-inc.comwrote:

 Scripting ability while still limited has better streaming support so

 you can have relations streamed Into a custom script executing in 
 either map or reduce phase depending upon where it is placed.




 --
 Ted Dunning, CTO
 DeepDyve




--
Ted Dunning, CTO
DeepDyve


Re: Algorithm implementations in Pig

2010-02-23 Thread Ankur C. Goel
Pallavi,
  Thanks for your comments. Some clarifications w.r.t pig.

Pig does not generate any M/R code. What is it generates is logical, physical 
and map-reduce plans that are nothing but DAGs. The map-reduce plan is  then 
interpreted by pig's own mappers/reducers. The plan generation itself is done 
on the client side and takes few seconds or minutes (if you have a really big 
script).

About performance tuning in hadoop, all the M/R parameters can be adjusted in 
pig to have the same effect they'd have in Java M/R programs. Pig 0.7 is moving 
towards using hadoop's input/output format in its load/store functions, so your 
custom i/o formats can be easily reused with little additional effort.

Pig also provides very nice features like MultiQuery optimization and skewed  
merge join that are hard to implement in Java M/R every time you need them.

With the latest pig release 0.6 the performance gap between Java M/R and Pig 
has been narrowed to a good extent.

Simple statistical measures that you would use to understand or preprocess your 
data are very easy to do with just few lines of pig code and lot of utility 
UDFs are available for that.

Besides all the good things, I agree that there are compatibility issues 
running pig-x on hadoop-y but this has also to do with new features of Hadoop 
that pig is able to exploit in its pipeline.

I also agree with the general opinion that for Pig's adoption in Mahout land it 
should play out well with Mahout's vector formats.

At the moment I don't have the proper free time to look into this but will 
surely get back to evaluating the feasibility of this integration in the coming 
few weeks. Till then any of the interested folks can fork a JIRA for this and 
work on it.


On 2/24/10 12:27 PM, Palleti, Pallavi pallavi.pall...@corp.aol.com wrote:

I too have mixed opinion w.r.t pig. Pig would be a good choice to
quickly prototype and test. However, following are the pitfalls I have
observed in pig.

It is not easy to debug in pig. Also, it have performance issues as it
is a layer on top of hadoop, so the overhead of converting pig into
map-reduce code. Also, when the code is available in hadoop, it is in
developer/user's hand to improve the performance by using various
parameters say, no of mappers, different input formats, etc. However is
not the case with pig. Also,there are some compatibility issues with pig
and hadoop. Say, if I am using pig-x version on hadoop-y version, there
might be some compatibility issues and need to spend time on resolving
the same as it is not easy to figure out the errors.
I believe the main motto of mahout is to propose scalable algorithms
which can be used to solve some real world problems. In such case, if
pig has got rid of above pitfalls, then it would be good choice as we
will have very less developing time efforts.

Thanks
Pallavi

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Monday, February 22, 2010 11:32 PM
To: mahout-dev@lucene.apache.org
Subject: Re: Algorithm implementations in Pig

As an interesting test case, can you write a pig program that counts
words.

BUT, it takes an input file name AND an input field name.

On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com
wrote:


 That isn't an issue here.  It is the invocation of pig programs and
 passing useful information to them that is the problem.


 On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel
gan...@yahoo-inc.comwrote:

 Scripting ability while still limited has better streaming support so

 you can have relations streamed Into a custom script executing in
 either map or reduce phase depending upon where it is placed.




 --
 Ted Dunning, CTO
 DeepDyve




--
Ted Dunning, CTO
DeepDyve