[off-topic] Maven and SCP deploy.
There are many folks knowledgeable about maven on this list, so I thought I'd ask -- I'm trying to write a POM with scp deployment, but maven consistently fails for me with authentication errors -- this is most likely caused by an outdated (and buggy) jsch dependency (0.1.38 instead of 0.1.42). Anybody knows how to override this dependency from within the POM for wagon-ssh? I tried a dozen different configurations, but none work. Dawid
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837156#action_12837156 ] Sean Owen commented on MAHOUT-305: -- You don't entirely drop ratings, it's that they don't figure into the similarity metric, right? But yes, ratings are not relevant to the point I was trying to make. Regardless, Harry Potter 3 is a better recommendation. I really think you have to take out the highest-rated items or this is a fairly flawed test, for this reason. Does anyone else have experience in or thoughts on defining precision and recall in this context? 3,4,5 is arbitrary, just pick the top n, or top n%, I'd imagine. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Look! No more ISSUES
Maybe announce a release candidate quickly. Then we might get feedback on any bugs from users outside Robin On Tue, Feb 23, 2010 at 3:10 PM, Sean Owen sro...@gmail.com wrote: I'm happy to play release engineer. Per http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release : - I'm targeting a release for as soon as possible. Let's say this Friday. - Let's call a code freeze right now. No changes except: - Javadoc fixes and improvements - New unit tests - Bug fixes due to new tests and new issues discovered now - Focus on testing, examining the build, running examples Meanwhile I'm going to start the motions for release so we can discover and discuss any wrinkles in the process that we didn't see last time. Sean On Mon, Feb 22, 2010 at 10:34 PM, Robin Anil robin.a...@gmail.com wrote: waiting for 301 to get commited. https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751styleName=Htmlversion=12314281 PMC's. Its in your hands now :D Robin
Re: data mining tool in hadoop
Check out Apache Mahout. http://lucene.apache.org/mahout You are welcome to contribute -- Robin Anil Blog: http://techdigger.wordpress.com --- Mahout in Action - Mammoth Scale machine learning Read Chapter 1 - Its Frrr http://www.manning.com/owen Try out Swipeball for iPhone http://itunes.com/apps/swipeball On Tue, Feb 23, 2010 at 4:00 PM, btp...@tce.edu wrote: hi all i am new to this hadoop environment.i like to develope a data mining tool for classification.Is there any data mining tool available in hadoop??? thanks parvathi - This email was sent using TCEMail Service. Thiagarajar College of Engineering Madurai-625 015, India
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837192#action_12837192 ] Ankur commented on MAHOUT-305: -- Just picking random N % data for each user calculating avg precision and recall across all users in test data and then repeating the test K times to take average across all runs should be reasonably fair assessment IMHO. Mahouters your opinion here would be valuable. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Look! No more ISSUES
On Tue Sean Owen sro...@gmail.com wrote: I'm happy to play release engineer. Great - Thanks, Sean. Isabel
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837193#action_12837193 ] Sean Owen commented on MAHOUT-305: -- I just don't think it can be random. it's like doing a PR test on search results and defining relevant documents as some randomly-chosen subset of all documents. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Look! No more ISSUES
Any thoughts on the highlights we call out for 0.3 on the web site? Is this anything like right? liNew math, collections modules/li liLLR colocation implementation/li liFP-bonsai implementation/li liHadoop-based Lanczos SVD solver/li liShell scripts for easier running of algorithms, examples/li li... and much much more: code cleanup, many bug fixes and performance improvements/li On Tue, Feb 23, 2010 at 11:02 AM, Isabel Drost isa...@apache.org wrote: On Tue Sean Owen sro...@gmail.com wrote: I'm happy to play release engineer. Great - Thanks, Sean. Isabel
Re: Look! No more ISSUES
this was the format last time New Mahout 0.2 features include - Major performance enhancements in Collaborative Filtering, Classification and Clustering - New: Latent Dirichlet Allocation(LDA) implementation for topic modelling - New: Frequent Itemset Mining for mining top-k patterns from a list of transactions - New: Decision Forests implementation for Decision Tree classification (In Memory Partial Data) - New: HBase storage support for Naive Bayes model building and classification - New: Generation of vectors from Text documents for use with Mahout Algorithms - Performance improvements in various Vector implementations - Tons of bug fixes and code cleanup On Tue, Feb 23, 2010 at 4:45 PM, Sean Owen sro...@gmail.com wrote: Any thoughts on the highlights we call out for 0.3 on the web site? Is this anything like right? liNew math, collections modules/li liLLR colocation implementation/li liFP-bonsai implementation/li liHadoop-based Lanczos SVD solver/li liShell scripts for easier running of algorithms, examples/li li... and much much more: code cleanup, many bug fixes and performance improvements/li On Tue, Feb 23, 2010 at 11:02 AM, Isabel Drost isa...@apache.org wrote: On Tue Sean Owen sro...@gmail.com wrote: I'm happy to play release engineer. Great - Thanks, Sean. Isabel
Re: Look! No more ISSUES
Format for what file? I'm editing the site's index.xml file here. On Tue, Feb 23, 2010 at 11:18 AM, Robin Anil robin.a...@gmail.com wrote: this was the format last time New Mahout 0.2 features include
Re: Look! No more ISSUES
There was a thread that went out for release announcements last time. Maybe we can start one for this one Robin On Tue, Feb 23, 2010 at 4:52 PM, Sean Owen sro...@gmail.com wrote: Format for what file? I'm editing the site's index.xml file here. On Tue, Feb 23, 2010 at 11:18 AM, Robin Anil robin.a...@gmail.com wrote: this was the format last time New Mahout 0.2 features include
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837198#action_12837198 ] Ankur commented on MAHOUT-305: -- I am not proposing that we choose random subset over all movies. Rather choose random N% movie ratings from EACH user and use it as test data to get precision recall across this test set. Also repeat this procedure X times to get a fair assessment. They seem to do it the same way - http://www2007.org/papers/paper570.pdf Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837203#action_12837203 ] Sean Owen commented on MAHOUT-305: -- Yes I understand that, and it still doesn't change the issue here. The paper here deals with a data set with no ratings; picking any item as test data is as good as the next. This isn't the case when we have ratings, and we do. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
0.3 release issues
OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot. It might not be such a sin to depend on 0.20.1. I believe it will break the CF job in some instances, but, this is not going to affect existing examples or unit tests (though it should :( ).
Re: 0.3 release issues
No issues with 0.20.1 for vectorizer, classifier, fpgrowth and clustering. What about lanczos and cf? On Tue, Feb 23, 2010 at 5:09 PM, Sean Owen sro...@gmail.com wrote: OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot. It might not be such a sin to depend on 0.20.1. I believe it will break the CF job in some instances, but, this is not going to affect existing examples or unit tests (though it should :( ).
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837205#action_12837205 ] Ankur commented on MAHOUT-305: -- Well! not factoring ratings in the similarity metric but having them influence the train/test data for evaluation doesn't sound fair to me. So I don't think both of us agree on the evaluation methodology. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837209#action_12837209 ] Sean Owen commented on MAHOUT-305: -- They don't influence the similarity metric but they do influence the estimated ratings and therefore recommendations. Are we talking about the same algorithm? The last step is to multiply the co-occurrence matrix by the user -rating- vector. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.3 release issues
We can publish 0.20.2 on our site. It's pretty easy to do. On Feb 23, 2010, at 6:39 AM, Sean Owen wrote: OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot. It might not be such a sin to depend on 0.20.1. I believe it will break the CF job in some instances, but, this is not going to affect existing examples or unit tests (though it should :( ).
Re: 0.3 release issues
Oh. I remember why. 0.20.1 doesn't exist in Maven. We had our own version of Hadoop 0.20.1, but when I run with that I get problems with it not finding Commons-Codec. I understand why that is and we could fix that too but yes Grant would be better, methinks, to publish our own copy of 0.20.2 for now and see how that flies. Er, how do we do that? Is it something you can describe, I can document and do? On Tue, Feb 23, 2010 at 11:41 AM, Robin Anil robin.a...@gmail.com wrote: No issues with 0.20.1 for vectorizer, classifier, fpgrowth and clustering. What about lanczos and cf? On Tue, Feb 23, 2010 at 5:09 PM, Sean Owen sro...@gmail.com wrote: OK first roadblock -- we can't depend on the Hadoop 0.20.2 snapshot. It might not be such a sin to depend on 0.20.1. I believe it will break the CF job in some instances, but, this is not going to affect existing examples or unit tests (though it should :( ).
Re: 0.3 release issues
On Tue Sean Owen sro...@gmail.com wrote: Er, how do we do that? Is it something you can describe, I can document and do? It already has been described - and documented in our wiki: http://cwiki.apache.org/MAHOUT/thirdpartydependencies.html Hope that helps, Isabel
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837230#action_12837230 ] Ankur commented on MAHOUT-305: -- *smile* There we go. Our last steps are essentially different. I don't do any multiplication, instead I just join (user, movie) on 'movie' with co-occurrence set followed by a group on 'user' to calculate recommendations. I guess while joining I should multiply ratings with co-occurrence counts for better evaluation. Can you give a small illustrative example with dummy data to describe your last steps? Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837234#action_12837234 ] Drew Farris commented on MAHOUT-301: bq. including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop.. The job files work fine with 'hadoop jar', but putting the job files in the classspath will not automatically include the dependencies they contain (e.g commons-cli2) on the classpath: the dependencies need to be added separately (see the ClassNotFoundException case described above) bq. BTW. How is hadoop execution done using shell script ? If the HADOOP_CONF_DIR is set, it should be picked up by the jobs, but I don't think that means jar/jobfile execution works properly. I suspect this needs modifications to make that possible. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.3 release issues
For hadoop, we probably should consider crafting a pom.xml file based on the SNAPSHOT's pom.xml so that hadoop dependencies will be included and we will not have to separately add them to Mahout. This also promotes re-use of the jars we do deploy by third parties (if the dependencies are correct). The hadoop 0.20.2-SNAPSHOT pom is at: https://repository.apache.org/service/local/repositories/snapshots/content/org/apache/hadoop/hadoop-core/0.20.2-SNAPSHOT/hadoop-core-0.20.2-20100112.125701-4.pom We'd need to modify this to change the version and package name. It probably would not be a bad idea to have a date/timestamp or svn revision number from which the build was pulled somewhere in the version, e.g: 0.20.2-r87654, so others can easily track down the sources it was built from. The command-line to deploy using a pom would be: mvn gpg:sign-and-deploy-file -Durl=https://repository.apache.org/service/local/staging/deploy/maven2 -DrepositoryId=apache.releases.https -DgroupId=org.apache.mahout.hadoop -DartifactId=hadoop-core -Dversion=0.20.2 -Dpackaging=jar -Dfile=hadoop-core-0.20.2.jar -DpomFile=hadoop-core-0.20.2.pom If someone doesn't get to it in the next 12 hours, I can take a look at it this evening in EST, perhaps sooner. The other option would be to wait until a 0.20.2 release is available, which could be imminent. Last I saw on the list they were on rc4? Drew On Tue, Feb 23, 2010 at 7:23 AM, Isabel Drost isa...@apache.org wrote: On Tue Sean Owen sro...@gmail.com wrote: Er, how do we do that? Is it something you can describe, I can document and do? It already has been described - and documented in our wiki: http://cwiki.apache.org/MAHOUT/thirdpartydependencies.html Hope that helps, Isabel
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837243#action_12837243 ] Drew Farris commented on MAHOUT-301: bq. BTW. How is hadoop execution done using shell script ? i.e It looks like something like the following would do the trick {code} /bin/mahout -core org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier {code} we could probably provide 'runjob' case that appends 'org.apache.hadoop.util.RunJar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver', but perhaps this could be used in every case that 'run' is called? Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837245#action_12837245 ] Sean Owen commented on MAHOUT-305: -- Yeah it's just a small generalization -- what I'm up to reduces to the same thing if all ratings are 1. You can make it run faster if there are no ratings involved, I'm sure. I'm going to send you a draft of chapter 6 of MiA which has a complete writeup on this. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.3 release issues
On Feb 23, 2010, at 8:47 AM, Drew Farris wrote: The other option would be to wait until a 0.20.2 release is available, which could be imminent. Last I saw on the list they were on rc4? This doesn't seem horribly bad. We should download and try the RC and provide feedback.
Re: 0.3 release issues
It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this.
Re: 0.3 release issues
On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this. +1. We might as well upgrade to the RC, too, by adding it as a dependency.
SVD for dummies
Hey Jake, Was just going to ask for more insight into SVD when lo and behold, I checked my commits mail and saw http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction. Very nice! Thank you! -Grant
Re: SVD for dummies
-- Robin Anil Blog: http://techdigger.wordpress.com --- Mahout in Action - Mammoth Scale machine learning Read Chapter 1 - Its Frrr http://www.manning.com/owen Try out Swipeball for iPhone http://itunes.com/apps/swipeball On Tue, Feb 23, 2010 at 8:23 PM, Grant Ingersoll gsing...@apache.orgwrote: Hey Jake, Was just going to ask for more insight into SVD when lo and behold, I checked my commits mail and saw http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction. Very nice! Thank you! -Grant
Re: SVD for dummies
Sorry about that. My send button got it. And i forgot to turn undo on :( What I was about to say was: Its a great library Jake has given. I can't wait to test it by using the reduced vectors in clustering Robin
Re: [off-topic] Maven and SCP deploy.
Hmm, what version are you on? I've done it successfully, but it usually requires some setup in your ~/.m2/settings.xml file to incorporate your public key, etc. I think Mahout has it configured. Check the How To Release page on the Wiki. On Feb 23, 2010, at 3:16 AM, Dawid Weiss wrote: There are many folks knowledgeable about maven on this list, so I thought I'd ask -- I'm trying to write a POM with scp deployment, but maven consistently fails for me with authentication errors -- this is most likely caused by an outdated (and buggy) jsch dependency (0.1.38 instead of 0.1.42). Anybody knows how to override this dependency from within the POM for wagon-ssh? I tried a dozen different configurations, but none work. Dawid
Re: 0.3 release issues
On Tue Grant Ingersoll gsing...@apache.org wrote: On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this. +1. We might as well upgrade to the RC, too, by adding it as a dependency. +1 (to both proposals) Isabel
[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837310#action_12837310 ] Danny Leshem commented on MAHOUT-180: - While testing the new code, I encountered the following issue: ... 10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished. 10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing Options Usage: [--eigenInput eigenInput --corpusInput corpusInput --help --output output --inMemory inMemory --maxError maxError --minEigenvalue minEigenvalue] Options ... The problem seems to be in DistributedLanczosSolver.java [73]: EigenVerificationJob expects the parameters' names to be --eigenInput and --corpusInput, but you're mistakenly passing them as --input and --output. Other than this minor issue, the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors. port Hadoop-ified Lanczos SVD implementation from decomposer Key: MAHOUT-180 URL: https://issues.apache.org/jira/browse/MAHOUT-180 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.2 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch I wrote up a hadoop version of the Lanczos algorithm for performing SVD on sparse matrices available at http://decomposer.googlecode.com/, which is Apache-licensed, and I'm willing to donate it. I'll have to port over the implementation to use Mahout vectors, or else add in these vectors as well. Current issues with the decomposer implementation include: if your matrix is really big, you need to re-normalize before decomposition: find the largest eigenvalue first, and divide all your rows by that value, then decompose, or else you'll blow over Double.MAX_VALUE once you've run too many iterations (the L^2 norm of intermediate vectors grows roughly as (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on the lower end is better than blowing over MAX_VALUE). When this is ported to Mahout, we should add in the capability to do this automatically (run a couple iterations to find the largest eigenvalue, save that, then iterate while scaling vectors by 1/max_eigenvalue). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837310#action_12837310 ] Danny Leshem edited comment on MAHOUT-180 at 2/23/10 4:59 PM: -- While testing the new code, I encountered the following issue: ... 10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished. 10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing Options Usage: [--eigenInput eigenInput --corpusInput corpusInput --help --output output --inMemory inMemory --maxError maxError --minEigenvalue minEigenvalue] Options ... The problem seems to be in DistributedLanczosSolver.java [73]: EigenVerificationJob expects the parameters' names to be eigenInput and corpusInput, but you're mistakenly passing them as input and output. Other than this minor issue, the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors. was (Author: dleshem): While testing the new code, I encountered the following issue: ... 10/02/23 18:11:17 INFO lanczos.LanczosSolver: LanczosSolver finished. 10/02/23 18:11:17 ERROR decomposer.EigenVerificationJob: Unexpected --input while processing Options Usage: [--eigenInput eigenInput --corpusInput corpusInput --help --output output --inMemory inMemory --maxError maxError --minEigenvalue minEigenvalue] Options ... The problem seems to be in DistributedLanczosSolver.java [73]: EigenVerificationJob expects the parameters' names to be --eigenInput and --corpusInput, but you're mistakenly passing them as --input and --output. Other than this minor issue, the code seems to be working fine and indeed produces the right amount of dense (eigen?) vectors. port Hadoop-ified Lanczos SVD implementation from decomposer Key: MAHOUT-180 URL: https://issues.apache.org/jira/browse/MAHOUT-180 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.2 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch I wrote up a hadoop version of the Lanczos algorithm for performing SVD on sparse matrices available at http://decomposer.googlecode.com/, which is Apache-licensed, and I'm willing to donate it. I'll have to port over the implementation to use Mahout vectors, or else add in these vectors as well. Current issues with the decomposer implementation include: if your matrix is really big, you need to re-normalize before decomposition: find the largest eigenvalue first, and divide all your rows by that value, then decompose, or else you'll blow over Double.MAX_VALUE once you've run too many iterations (the L^2 norm of intermediate vectors grows roughly as (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on the lower end is better than blowing over MAX_VALUE). When this is ported to Mahout, we should add in the capability to do this automatically (run a couple iterations to find the largest eigenvalue, save that, then iterate while scaling vectors by 1/max_eigenvalue). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837324#action_12837324 ] Jake Mannix commented on MAHOUT-180: Hi Danny, thanks for trying this out! You have indeed found some testing code which snuck in - I was trying to add the EigenVerificationJob to the final step of Lanczos, to save people the trouble of having to clean their eigenvectors at the end of the day, but didn't finish and yet it got checked in. The clue in the code is that I still have a line: {code} // TODO ack! {code} Which should be a hint that I should not have checked that file in just yet. :) I've removed it now - svn up and try again! If you want to see what your eigen-spectrum is like, after you've run the DistributedLanczosSolver, the EigenVerificationJob can be run next (it cleans out eigenvectors with too high error or too low eigenvalue): {code} $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-{version}.job org.apache.mahout.math.hadoop.decomposer.EigenVerificationJob \ --eigenInput path/for/svd-output --corpusInput path/to/corpus --output path/for/cleanOutput --maxError 0.1 --minEigenvalue 10.0 {code} Thanks for the bug report! port Hadoop-ified Lanczos SVD implementation from decomposer Key: MAHOUT-180 URL: https://issues.apache.org/jira/browse/MAHOUT-180 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.2 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.3 Attachments: MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch, MAHOUT-180.patch I wrote up a hadoop version of the Lanczos algorithm for performing SVD on sparse matrices available at http://decomposer.googlecode.com/, which is Apache-licensed, and I'm willing to donate it. I'll have to port over the implementation to use Mahout vectors, or else add in these vectors as well. Current issues with the decomposer implementation include: if your matrix is really big, you need to re-normalize before decomposition: find the largest eigenvalue first, and divide all your rows by that value, then decompose, or else you'll blow over Double.MAX_VALUE once you've run too many iterations (the L^2 norm of intermediate vectors grows roughly as (largest-eigenvalue)^(num-eigenvalues-found-so-far), so losing precision on the lower end is better than blowing over MAX_VALUE). When this is ported to Mahout, we should add in the capability to do this automatically (run a couple iterations to find the largest eigenvalue, save that, then iterate while scaling vectors by 1/max_eigenvalue). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.3 release issues
+1 to code freeze, waiting for hadoop release and testing the RC On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote: On Tue Grant Ingersoll gsing...@apache.org wrote: On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this. +1. We might as well upgrade to the RC, too, by adding it as a dependency. +1 (to both proposals) Isabel -- Ted Dunning, CTO DeepDyve
Re: 0.3 release issues
So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. It's minimally invasive (one java file, and some config files, all additions (not changes), and then changes to the mahout shell script), and has the potential to really make repeatedly running our tools far easier than it currently is. -jake ps. if we are really doing a code freeze, can we make a dev branch (or more appropriately, make a 0.3 release branch, and allow continued development on trunk)? We don't really want to hold up on producing new stuff, do we? More more MORE! ;) On Tue, Feb 23, 2010 at 9:25 AM, Ted Dunning ted.dunn...@gmail.com wrote: +1 to code freeze, waiting for hadoop release and testing the RC On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote: On Tue Grant Ingersoll gsing...@apache.org wrote: On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this. +1. We might as well upgrade to the RC, too, by adding it as a dependency. +1 (to both proposals) Isabel -- Ted Dunning, CTO DeepDyve
Re: 0.3 release issues
I say use your judgment. If you feel confident enough for it to be enshrined for about 3 months in an official release, check it in. Soon we will indeed want a proper release branch. This time it shouldn't be more than a few days, and, probably good discipline to force everyone to do nothing but documentation and test improvements for that time. On Tue, Feb 23, 2010 at 5:40 PM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. It's minimally invasive (one java file, and some config files, all additions (not changes), and then changes to the mahout shell script), and has the potential to really make repeatedly running our tools far easier than it currently is. Â -jake ps. Â if we are really doing a code freeze, can we make a dev branch (or more appropriately, make a 0.3 release branch, and allow continued development on trunk)? Â We don't really want to hold up on producing new stuff, do we? More more MORE! ;)
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837345#action_12837345 ] Jake Mannix commented on MAHOUT-301: Hey Drew, thanks for looking at this. Problems you saw are probably what are known as bugs. :) {quote} Did some testing, here's a patch to clean some of these things up + a couple questions: Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch) {quote} YES! We should indeed load from classpath. My most recent version of this patch (which isn't posted, because it conflicts with yours, I'm trying to resolve that now) changes it so that you just supply a single directory in which driver.classes.props and the shortNames.props files are located. {quote} Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g: ./mahout vectordump Exception in thread main java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException (fixed in patch) {code} This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself? {code} Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release - this is the way it should work, right? {code} What is the -core option for? I've never used it, how does it work? {code} Also added a help message for the 'run' argument. {code} Where did you add that? {code} Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down) {code} The --help option I didn't have in there, you added it, do you know where it's hanging? Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code
Re: 0.3 release issues
WHat about a new follow-on JIRA so 301 can stay in the official release notes? On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. -- Ted Dunning, CTO DeepDyve
Re: 0.3 release issues
What does this mean? You mean make a 301-continuation ticket for 0.4, and reschedule the original 301 for 0.3? I could do that *if* 301 is in good shape by the time Hadoop is ready, but I don't want to reschedule it until it is. -jake On Tue, Feb 23, 2010 at 9:57 AM, Ted Dunning ted.dunn...@gmail.com wrote: WHat about a new follow-on JIRA so 301 can stay in the official release notes? On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837351#action_12837351 ] Jake Mannix commented on MAHOUT-301: Ok, Drew, got your patch in diff mode against mine finally. So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable MAHOUT_CONF_DIR which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way. New patch merging yours with mine forthcoming. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.3 release issues
I assume he means mark 301 as fixed (when appropriate) and then open a new ticket for follow on work marked for 0.4. -Grant On Feb 23, 2010, at 1:03 PM, Jake Mannix wrote: What does this mean? You mean make a 301-continuation ticket for 0.4, and reschedule the original 301 for 0.3? I could do that *if* 301 is in good shape by the time Hadoop is ready, but I don't want to reschedule it until it is. -jake On Tue, Feb 23, 2010 at 9:57 AM, Ted Dunning ted.dunn...@gmail.com wrote: WHat about a new follow-on JIRA so 301 can stay in the official release notes? On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. -- Ted Dunning, CTO DeepDyve
Re: 0.3 release issues
Exactly. On Tue, Feb 23, 2010 at 10:16 AM, Grant Ingersoll gsing...@apache.orgwrote: I assume he means mark 301 as fixed (when appropriate) and then open a new ticket for follow on work marked for 0.4. -Grant On Feb 23, 2010, at 1:03 PM, Jake Mannix wrote: What does this mean? You mean make a 301-continuation ticket for 0.4, and reschedule the original 301 for 0.3? I could do that *if* 301 is in good shape by the time Hadoop is ready, but I don't want to reschedule it until it is. -jake On Tue, Feb 23, 2010 at 9:57 AM, Ted Dunning ted.dunn...@gmail.com wrote: WHat about a new follow-on JIRA so 301 can stay in the official release notes? On Tue, Feb 23, 2010 at 9:40 AM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve
Re: 0.3 release issues
FWIW, I think we're pretty close on MAHOUT-301. I also think there may be a couple issues with what's in there now for bin/mahout (pre-301). So I'd say +1 for a freeze except for MAHOUT-301 (which I guess isn't a freeze then :-D) +1 for switching to the hadoop rc for testing. On Tue, Feb 23, 2010 at 12:40 PM, Jake Mannix jake.man...@gmail.com wrote: So to be an annoying voice of dissent... I'm going to keep iterating on MAHOUT-301, targetted for 0.4, and I will keep it in patch form (not checked in) _for now_... but if it can get its wrinkles ironed out before Hadoop gets its act together, I really think it should get committed to 0.3. It's minimally invasive (one java file, and some config files, all additions (not changes), and then changes to the mahout shell script), and has the potential to really make repeatedly running our tools far easier than it currently is. Â -jake ps. Â if we are really doing a code freeze, can we make a dev branch (or more appropriately, make a 0.3 release branch, and allow continued development on trunk)? Â We don't really want to hold up on producing new stuff, do we? More more MORE! ;) On Tue, Feb 23, 2010 at 9:25 AM, Ted Dunning ted.dunn...@gmail.com wrote: +1 to code freeze, waiting for hadoop release and testing the RC On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote: On Tue Grant Ingersoll gsing...@apache.org wrote: On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this. +1. Â We might as well upgrade to the RC, too, by adding it as a dependency. +1 (to both proposals) Isabel -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837378#action_12837378 ] Ted Dunning commented on MAHOUT-305: My own experience is that all that counts in recommendations is the probability of click (interest) on a set of recommendations. As such, the best analog is probably precision at 10 or 20. I don't think that recall at 10 or 20 makes any sense at all (with a depth limited situation like this, you have given up on recall and are only looking at precision). Ankur's suggestion about keeping the most recent 4's and 5's as test data seems right to me. My only beefs are that you don't need rec...@10 and what to do with the unrated items. Presumably a new style algorithm could surface items that the user hadn't thought of, but really likes. In practice, I think that counting unrated items in the results as misses isn't a big deal in the Netflix data. In the real world where test data is more scarce, I would count unrated items as misses in off-line evaluation, but try to run as many alternatives as possible against live users. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837376#action_12837376 ] Drew Farris commented on MAHOUT-301: {quote} This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself? {quote} Yes it was a problem with the script in trunk. I believe this was due to the fact that the job files were on the classpath instead of all of the dependency jars. Adding the job files to the classpath does not add the dependency jars they contain to the classpath as well. So, no you didn't add this, but it should be fixed (and is in the patch) {quote} What is the -core option for? I've never used it, how does it work? {quote} when you're running bin/mahout in the context of a build the -core option is used to tell it to use the build classpath instead of the classpath used for a binary release. This just follows the pattern established (by Doug?) in the hadoop and nutch launch scripts. {quote} Also added a help message for the 'run' argument. {quote} near line 72 in bin/mahout: (this is different from the --help question I had) {code} echo seq2sparsegenerate sparse vectors from a sequence file echo vectordumpdump vectors from a sequence file echo run run mahout tasks using the MahoutDriver, see: http://cwiki.apache.org/MAHOUT/mahoutdriver.html; {code} {quote} So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable MAHOUT_CONF_DIR which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way. {quote} Yep, that should do it, as long as MAHOUT_CONF_DIR appears before src/main/resources, we should be good to go. It should be added outside of the section of the script that determines if -core has been specified on the command-line. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837377#action_12837377 ] Ted Dunning commented on MAHOUT-305: My own experience is that all that counts in recommendations is the probability of click (interest) on a set of recommendations. As such, the best analog is probably precision at 10 or 20. I don't think that recall at 10 or 20 makes any sense at all (with a depth limited situation like this, you have given up on recall and are only looking at precision). Ankur's suggestion about keeping the most recent 4's and 5's as test data seems right to me. My only beefs are that you don't need rec...@10 and what to do with the unrated items. Presumably a new style algorithm could surface items that the user hadn't thought of, but really likes. In practice, I think that counting unrated items in the results as misses isn't a big deal in the Netflix data. In the real world where test data is more scarce, I would count unrated items as misses in off-line evaluation, but try to run as many alternatives as possible against live users. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: SVD for dummies
Here's another source of understanding about svd. MIT Professor Gilbert Strang's lecture on the subject. Strang is a fine educator and gentleman and wonderfully clear with his explanation of the underlying geometry of SVD. http://videolectures.net/mit1806s05_strang_lec29/ you can also get find Strang's lecture and others by searching singular value decomposition on YouTube. Mike -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, February 23, 2010 6:54 AM To: Mahout Dev List Subject: SVD for dummies Hey Jake, Was just going to ask for more insight into SVD when lo and behold, I checked my commits mail and saw http://cwiki.apache.org/confluence/display/MAHOUT/DimensionalReduction. Very nice! Thank you! -Grant
Re: [off-topic] Maven and SCP deploy.
Hmm, what version are you on? Â I've done it successfully, but it usually requires some setup in your Apache Maven 2.2.1 (r801777; 2009-08-06 21:16:01+0200) I'm using POM configuration based on an identical parent as Mahout, but no success. Interestingly, the default scp (built-in) works on one server, but fails on another. I've seen tons of e-mails about this on the internet, but apparently the problem is due to an outdated jsch... It may be this particular server's SSH configuration, although command-line SCP does work (and so does SSH). I've tried in all possible configurations -- scp, scpexe (external scp), private key and password authentication. scpexe was the best of all the configurations and it worked from time to time... but not always. Can you see what your set of plugins is, Grant (if you issue mvn -X ... | grep jsch it'll print jsch plugins actually used). I did manage to set up an ftp deployment, so this is not crucial for me anymore, but I'd rather have scp than ftp. Thanks, Dawid ~/.m2/settings.xml file to incorporate your public key, etc. Â I think Mahout has it configured. Â Check the How To Release page on the Wiki. On Feb 23, 2010, at 3:16 AM, Dawid Weiss wrote: There are many folks knowledgeable about maven on this list, so I thought I'd ask -- I'm trying to write a POM with scp deployment, but maven consistently fails for me with authentication errors -- this is most likely caused by an outdated (and buggy) jsch dependency (0.1.38 instead of 0.1.42). Anybody knows how to override this dependency from within the POM for wagon-ssh? I tried a dozen different configurations, but none work. Dawid
Re: [off-topic] Maven and SCP deploy.
Can you over-ride that dependency? On Tue, Feb 23, 2010 at 12:07 PM, Dawid Weiss dawid.we...@gmail.com wrote: but apparently the problem is due to an outdated jsch -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837428#action_12837428 ] Jake Mannix commented on MAHOUT-301: {quote} Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g: {quote} {quote} Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch) {quote} When I run locally now, not using -core, I get this failure: {code} /bin/mahout vectordump -s wiki-sparse-vectors-out/vectors/part-0 Exception in thread main java.lang.NoClassDefFoundError: org/apache/mahout/utils/vectors/VectorDumper {code} This appears to be because your patch has CLASSPATH set to add on things like $MAHOUT_HOME/mahout-*.jar, which doesn't exist after I've done mvn install. Is there another maven target I need to use to generate the release jars in $MAHOUT_HOME? Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837434#action_12837434 ] Drew Farris commented on MAHOUT-301: Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release. The binary release, built using mvn -Prelease, lands in target/mahout-0.3-SNAPSHOT.tar.gz, untar that and try running bin/mahout from the directory that's created and that should work fine without -core Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837440#action_12837440 ] Jake Mannix commented on MAHOUT-301: {quote} Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release. {quote} Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use. So far, I've always run by the process of * make code/config changes * run mvn clean install (sometimes with -DskipTests if I'm doing rapid iterations) * run mahout comand args OR * hadoop jar examples/target/mahout-examples-{version}.job classname args The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath. Maybe a kludgey way to do it would be for the script to grab the properties files from the MAHOUT_CONF_DIR, unzip the release job jar, push them into it, and re-jar it back up and then give it to hadoop, and now those files will be available on the classpath of the running job on the remote cluster? What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to hadoop that I'm forgetting? Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837448#action_12837448 ] Drew Farris commented on MAHOUT-301: {quote} Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use. {quote} Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway. {quote} The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath. {quote} Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like: {code} ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier {code} I suspect we could easilly modify the mahout script and shorten this to: {code} ./bin/mahout runjob TestClassifier {code} I can look at this a little closer tonight, so if you have an updated patch for me to work on/test in a few hours, definitely post it. I'd be happy to make any changes you're interested in. {quote} What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to hadoop that I'm forgetting? {quote} FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837472#action_12837472 ] Jake Mannix commented on MAHOUT-301: {quote} Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway. {quote} Cool, yeah, that makes sense. {quote} Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like: {code} ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier {code} I suspect we could easilly modify the mahout script and shorten this to: {code} ./bin/mahout runjob TestClassifier {code} {quote} Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do runjob as described, if it's not, do run to do locally. {quote} FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives {quote} Now of course, I guess I don't really need the files to get onto the job's classpath *on the cluster* - it just needs to be on the classpath of the locally running jvm which is invoking MahoutDriver.main(). So I was doing more work than was necessary. This is easy to do, just add MAHOUT_CONF_DIR to the classpath and we're good to go. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837477#action_12837477 ] Drew Farris commented on MAHOUT-301: bq. Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do runjob as described, if it's not, do run to do locally. Yes, ok -- that should work because I believe you can use RunJar to launch anything even if it isn't a mapreduce job, no need for classpath setup in this case either -- all you need to do is point to the examples job. Might be able to take advantage of this elsewhere. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [off-topic] Maven and SCP deploy.
barely. This article might help: http://unitstep.net/blog/2009/05/18/resolving-log4j-1215-dependency-problems-in-maven-using-exclusions/ On Tue, Feb 23, 2010 at 12:20 PM, Dawid Weiss dawid.we...@gmail.com wrote: do you know how/where to place such an override? -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837528#action_12837528 ] Ted Dunning commented on MAHOUT-305: {quote} Yeah in this context there's no choice but to count unrated items as misses. My intuition based on limited experience is it is in fact an issue - are the best items for a user typically found among their ratings in real-world data sets? I just can't imagine it's so for most users, who express few ratings. {quote} This suggests that mean reciprocal rank (MRR) of the top 5 or 10 highly rated items might be a useful measure. Even if the top 10 has several unrated good choices, if the rated choices are all pretty high then you can have pretty good feelings even if they didn't quite make the top 10. Combine both cooccurrence-based CF M/R jobs --- Key: MAHOUT-305 URL: https://issues.apache.org/jira/browse/MAHOUT-305 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.2 Reporter: Sean Owen Assignee: Ankur Priority: Minor We have two different but essentially identical MapReduce jobs to make recommendations based on item co-occurrence: org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be merged. Not sure exactly how to approach that but noting this in JIRA, per Ankur. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-301: --- Attachment: MAHOUT-301.patch Ok, new patch. This one works in one of two ways. If you have $MAHOUT_CONF_DIR defined (there are some dummy files living in the newly created directory conf at the top level, moving away from core/src/main/resources), then you can just run: {code} $MAHOUT_HOME/bin/mahout run svd {code} and it should read your properties in $MAHOUT_CONF_DIR/svd.props and run (locally). The other way it can work (and actually does, at least on my setup) is running on hadoop: {code} $HADOOP_HOME/bin/hadoop jar path/to/mahout.job org.apache.mahout.driver.MahoutDriver svd {code} And again, $MAHOUT_CONF_DIR/svd.props is read locally before being launched off to the hadoop cluster. I have not yet been able to get the idea of turning the shell script into automagically issuing RunJar as the command and passing MahoutDriver and the remaining args after, so that you would never need to run hadoop's shell script at all, although that would be great to have work. Also not yet in this patch: actually default set MAHOUT_CONF_DIR to the correct place in both dev mode and release mode, and I haven't modified the pom to package up the new conf dir and put it in the distribution. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837607#action_12837607 ] Drew Farris commented on MAHOUT-301: It doesn't appear that the following command works as intended: {code} ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier {code} The following seems to be the appropriate way to achieve what we're trying to do here: {code} hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier {code} Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837616#action_12837616 ] Jake Mannix commented on MAHOUT-301: Our comments crossed in the ether! :) {quote} Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that. {quote} You're right - I did indeed set my HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR, which allowed this to work, otherwise it would not. This should be done by the script. Ideally, yes, it's ugly but if $MAHOUT_HOME/bin/mahout just sets $HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR (or $MAHOUT_HOME/conf if that variable is not set), then just execute $HADOOP_HOME/bin/hadoop jar ... then it should work. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-301) Improve command-line shell script by allowing default properties files
[ https://issues.apache.org/jira/browse/MAHOUT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-301: --- Attachment: MAHOUT-301.patch Ok, now we're getting somewhere. This one a) has the ability to properly handle mahout run -h or mahout run --help, helpfully spitting out the list of classes with shortName's which MahoutDriver has been told about in the driver.classes.props, and more importantly, it can, both in a release environment, and in a dev environment, do: {code} ./bin/mahout run kmeans [options] {code} If $MAHOUT_CONF_DIR is set, and points to a place with the right files, then the default properties are loaded from there (overridden by [options] given above). If both $HADOOP_HOME and $HADOOP_CONF_DIR are set, then this actually sets $HADOOP_CLASSPATH to be prepended with $MAHOUT_CONF_DIR so that the following is actually run: {code} $HADOOP_HOME/bin/hadoop jar [path to examples.job] o.a.m.driver.MahoutDriver kmeans [options] {code} actually works and it gets the default properties loaded and overridden as necessary, running your job on the hadoop cluster. If one of those variables are not specified (TODO: if $HADOOP_HOME is specified, but $HADOOP_CONF_DIR is not, guess a default of $HADOOP_HOME/conf, I suppose), then the assumption is to run locally. Previous behavior still works, from what I can tell - you can still do: {code} $MAHOUT_HOME/bin/mahout kmeans --output kmeans/out --input input/vecs -k 13 --clusters tmp/foobar {code} and we're backwards compatible with the old way. Now the question is: do we want to be? Or do we want to trim down the shell script to just always use MahoutDriver, and get rid of all of the 'elif [ $COMMAND =' stuff and just have $CLASS be MahoutDriver, passing it $COMMAND as the first argument? Then the command line would be exactly the same as before, except you could also load up your $MAHOUT_CONF_DIR/shortName.props files with whatever defaults you wanted to use. Improve command-line shell script by allowing default properties files -- Key: MAHOUT-301 URL: https://issues.apache.org/jira/browse/MAHOUT-301 Project: Mahout Issue Type: New Feature Components: Utils Affects Versions: 0.3 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Minor Fix For: 0.4 Attachments: MAHOUT-301-drew.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch, MAHOUT-301.patch Snippet from javadoc gives the idea: {code} /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \ * [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc] * * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed? * * (note: using the current shell scipt, this could be modified to be just * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options] * ) * * Works like this: by default, the file core/src/main/resources/driver.classes.prop is loaded, which * defines a mapping between short names like VectorDumper and fully qualified class names. This file may * instead be overridden on the command line by having the first argument be some string of the form *classes.props. * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). After this, if the next argument ends in .props / .properties, it is taken to * be the file to use as the default properties file for this execution, and key-value pairs are built up from that: * if the file contains * * input=/path/to/my/input * output=/path/to/my/output * * Then the class which will be run will have it's main called with * * main(new String[] { --input, /path/to/my/input, --output, /path/to/my/output }); * * After all the default properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. */ {code} Could be cleaned up, as it's kinda ugly with the whole file named in .props, but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.3 release issues
+1 from me too Ted Dunning wrote: +1 to code freeze, waiting for hadoop release and testing the RC On Tue, Feb 23, 2010 at 8:38 AM, Isabel Drost isa...@apache.org wrote: On Tue Grant Ingersoll gsing...@apache.org wrote: On Feb 23, 2010, at 9:18 AM, Sean Owen wrote: It does look imminent. As much as I don't like holding out longer, and indefinitely, for this release, somehow I'd also really like to link to the latest/greatest and official Hadoop release. Let's try to be good about sticking to the code freeze -- good chance to focus on polish -- and if 0.20.2 isn't out by end of week, revisit this. +1. We might as well upgrade to the RC, too, by adding it as a dependency. +1 (to both proposals) Isabel
RE: Algorithm implementations in Pig
I too have mixed opinion w.r.t pig. Pig would be a good choice to quickly prototype and test. However, following are the pitfalls I have observed in pig. It is not easy to debug in pig. Also, it have performance issues as it is a layer on top of hadoop, so the overhead of converting pig into map-reduce code. Also, when the code is available in hadoop, it is in developer/user's hand to improve the performance by using various parameters say, no of mappers, different input formats, etc. However is not the case with pig. Also,there are some compatibility issues with pig and hadoop. Say, if I am using pig-x version on hadoop-y version, there might be some compatibility issues and need to spend time on resolving the same as it is not easy to figure out the errors. I believe the main motto of mahout is to propose scalable algorithms which can be used to solve some real world problems. In such case, if pig has got rid of above pitfalls, then it would be good choice as we will have very less developing time efforts. Thanks Pallavi -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, February 22, 2010 11:32 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve
Re: Algorithm implementations in Pig
Pallavi, Thanks for your comments. Some clarifications w.r.t pig. Pig does not generate any M/R code. What is it generates is logical, physical and map-reduce plans that are nothing but DAGs. The map-reduce plan is then interpreted by pig's own mappers/reducers. The plan generation itself is done on the client side and takes few seconds or minutes (if you have a really big script). About performance tuning in hadoop, all the M/R parameters can be adjusted in pig to have the same effect they'd have in Java M/R programs. Pig 0.7 is moving towards using hadoop's input/output format in its load/store functions, so your custom i/o formats can be easily reused with little additional effort. Pig also provides very nice features like MultiQuery optimization and skewed merge join that are hard to implement in Java M/R every time you need them. With the latest pig release 0.6 the performance gap between Java M/R and Pig has been narrowed to a good extent. Simple statistical measures that you would use to understand or preprocess your data are very easy to do with just few lines of pig code and lot of utility UDFs are available for that. Besides all the good things, I agree that there are compatibility issues running pig-x on hadoop-y but this has also to do with new features of Hadoop that pig is able to exploit in its pipeline. I also agree with the general opinion that for Pig's adoption in Mahout land it should play out well with Mahout's vector formats. At the moment I don't have the proper free time to look into this but will surely get back to evaluating the feasibility of this integration in the coming few weeks. Till then any of the interested folks can fork a JIRA for this and work on it. On 2/24/10 12:27 PM, Palleti, Pallavi pallavi.pall...@corp.aol.com wrote: I too have mixed opinion w.r.t pig. Pig would be a good choice to quickly prototype and test. However, following are the pitfalls I have observed in pig. It is not easy to debug in pig. Also, it have performance issues as it is a layer on top of hadoop, so the overhead of converting pig into map-reduce code. Also, when the code is available in hadoop, it is in developer/user's hand to improve the performance by using various parameters say, no of mappers, different input formats, etc. However is not the case with pig. Also,there are some compatibility issues with pig and hadoop. Say, if I am using pig-x version on hadoop-y version, there might be some compatibility issues and need to spend time on resolving the same as it is not easy to figure out the errors. I believe the main motto of mahout is to propose scalable algorithms which can be used to solve some real world problems. In such case, if pig has got rid of above pitfalls, then it would be good choice as we will have very less developing time efforts. Thanks Pallavi -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, February 22, 2010 11:32 PM To: mahout-dev@lucene.apache.org Subject: Re: Algorithm implementations in Pig As an interesting test case, can you write a pig program that counts words. BUT, it takes an input file name AND an input field name. On Mon, Feb 22, 2010 at 9:56 AM, Ted Dunning ted.dunn...@gmail.com wrote: That isn't an issue here. It is the invocation of pig programs and passing useful information to them that is the problem. On Mon, Feb 22, 2010 at 9:20 AM, Ankur C. Goel gan...@yahoo-inc.comwrote: Scripting ability while still limited has better streaming support so you can have relations streamed Into a custom script executing in either map or reduce phase depending upon where it is placed. -- Ted Dunning, CTO DeepDyve -- Ted Dunning, CTO DeepDyve