Participating in GSOC 2009
Hi, I am planning to participate again in GSOC this year and I want to do it again under Mahout. I am finishing up my Masters in CSE from IIT Kharagpur and will be joining Yahoo! this fall, where I would be seeing a lot of Hadoop :). I would like to thank all of you for all the help you have given. This year I want to primarily focus on completing the classification module for Mahout and to . Till now we have a M/R version of NB Model building. I want to add the following features 1) Batch classification of flat file documents and flat file model 2) Storing the model using HBase 3) Quick classification of a document using HBase Model. 4) Batch classification using HBase Model 5*) Another option I have for quick model access is to load the key-value pairs over Memcached servers(I use this currently in my Masters thesis work). I would like to know if this is something that is aligned with Mahout's objectives Apart from this I am looking at the other patches and see where they stand and would like to attack another classification algorithm and implement the above features for the same. Warm Regards Robin
Re: [VOTE] Mahout 0.1
Here's my +1 On Mar 28, 2009, at 5:49 AM, Grant Ingersoll wrote: [Take 2. I fixed the NOTICE file, but did not change the artifact generation issue for now.] Please review and vote for releasing Mahout 0.1. This is our first release and is all new code. The artifacts in are located in: http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/ The mahout directory contains a tarball/zip of the whole project (for building from source) The core, examples and taste-web directories contain the artifacts for each of those components. The other directories contain various dependencies and artifacts. Thanks, Grant
Re: [gsoc] random forests
Thank you for your answer, it just made me aware of many hidden-possible-future problems with my implementation. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. by out-of-core you mean the builder can fetch the data directly from a file instead of working from in-memory only (?) One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. I was planning to distribute the dataset files to all workers using Hadoop's DistributedCache. I think that a streaming implementation is feasible, the basic tree building algorithm (described here http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream through the data (either in-memory or from a file) for each node of the tree. During this pass, it computes the information gain (IG) for the selected variables. This algorithm could be improved to compute the IG's for a list of nodes, thus reducing the total number of passes through the data. When building the forest, the list of nodes comes from all the trees built by the mapper. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. I'll have to run some tests before answering this question, but I think that the memory usage of the improved algorithm (described above) will mainly be needed to store the IG's computations (variable probabilities...). One way to limit the memory usage is to limit the number of tree-nodes computed at each data pass. Increasing this limit should reduce the data passes but increase the memory usage, and vice versa. There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. abdelHakim --- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Lundi 30 Mars 2009, 0h59 I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote: My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [VOTE] Mahout 0.1
+1 On Mon, Mar 30, 2009 at 6:50 PM, Grant Ingersoll gsing...@apache.org wrote: Here's my +1 On Mar 28, 2009, at 5:49 AM, Grant Ingersoll wrote: [Take 2. I fixed the NOTICE file, but did not change the artifact generation issue for now.] Please review and vote for releasing Mahout 0.1. This is our first release and is all new code. The artifacts in are located in: http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/ The mahout directory contains a tarball/zip of the whole project (for building from source) The core, examples and taste-web directories contain the artifacts for each of those components. The other directories contain various dependencies and artifacts. Thanks, Grant
[jira] Created: (MAHOUT-112) Maven jetty plugin has been relocated
Maven jetty plugin has been relocated - Key: MAHOUT-112 URL: https://issues.apache.org/jira/browse/MAHOUT-112 Project: Mahout Issue Type: Bug Affects Versions: 0.1 Reporter: Jukka Zitting Seen when building taste-web in the 0.1 release candidate and in the current Mahout trunk: {noformat} Downloading: http://repo1.maven.org/maven2/org/mortbay/jetty/maven-jetty-plugin/7.0.0.pre5/maven-jetty-plugin-7.0.0.pre5.jar [INFO] [ERROR] BUILD FAILURE [INFO] [INFO] A required plugin was not found: Plugin could not be found - check that the goal name is correct: Unable to download the artifact from any repository {noformat} The Jetty plugin has been relocated from maven-jetty-plugin to jetty-maven-plugin. The following change solves the issue: {noformat} Index: taste-web/pom.xml === --- taste-web/pom.xml (Revision 760032) +++ taste-web/pom.xml (Arbeitskopie) @@ -82,7 +82,7 @@ plugin groupIdorg.mortbay.jetty/groupId -artifactIdmaven-jetty-plugin/artifactId +artifactIdjetty-maven-plugin/artifactId configuration webApp${project.build.directory}/${project.artifactId}-${project.version}.war/webApp /configuration {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-113) CDInfosToolTest.testGatherInfos failure in Mahout examples
[ https://issues.apache.org/jira/browse/MAHOUT-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting updated MAHOUT-113: - Attachment: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest-output.txt Attached test output. CDInfosToolTest.testGatherInfos failure in Mahout examples -- Key: MAHOUT-113 URL: https://issues.apache.org/jira/browse/MAHOUT-113 Project: Mahout Issue Type: Bug Affects Versions: 0.1 Environment: Maven version: 2.0.9 Java version: 1.6.0_07 OS name: linux version: 2.6.26.6-79.fc9.i686 arch: i386 Family: unix Reporter: Jukka Zitting Priority: Minor Attachments: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest-output.txt I'm getting the following test failure when running mvn clean install on a fresh checkout of Mahout trunk: {noformat} --- Test set: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest --- Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.828 sec FAILURE! testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) Time elapsed: 1.798 sec FAILURE! junit.framework.AssertionFailedError: expected:48 but was:46 at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.failNotEquals(Assert.java:280) at junit.framework.Assert.assertEquals(Assert.java:64) at junit.framework.Assert.assertEquals(Assert.java:198) at junit.framework.Assert.assertEquals(Assert.java:204) at org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest.testGatherInfos(CDInfosToolTest.java:207) {noformat} I'll attach the test output file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [gsoc] random forests
Indeed. And those datasets exist. It is also plausible that this full data scan approach will fail when you want the forest building to take less time. It is also plausible that a full data scan approach fails to improve enough on a non-parallel implementation. This would happen if a significantly large fraction of the entire forest could be built on a single node. That would happen if the CPU requirements for forest building are overshadowed by the I/O cost of scanning the data set. This would imply that there is a small limit to the amount of parallelism that would help. You will know much more about this after you finish the non-parallel implementation than either of us knows now. On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim a_dene...@yahoo.frwrote: There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. -- Ted Dunning, CTO DeepDyve
Re: [VOTE] Mahout 0.1
Hi, On Sat, Mar 28, 2009 at 11:49 AM, Grant Ingersoll gsing...@apache.org wrote: Please review and vote for releasing Mahout 0.1. -1 I'm getting conflicting checksums for the mahout-0.1-project.tar.gz package that I'm using to review the sources. The checksums of my download are: MD5: f7341a0eb773d9f96ea87914f3c845dc SHA1: 5c84eb607c07119aceb1f3824e38205474dabc08 But the .md5 and .sha1 files claim the following: MD5: 46fd344aa6e88b7943e988dddaccd121 SHA1: 5b8bd6840718d2546b93656a3cc1528e7dfda75d I'm still getting test errors on Windows, but now I also built the release candidate on my Linux desktop where Mahout core passes all tests. Note however the issues MAHOUT-112 and MAHOUT-113 which I encountered in the taste-web and examples modules. Other than those the release looks good. Thanks for resolving some of the issues I raised earlier. I'd be willing to vote +1 even without a new release candidate if the checksum issues can be tracked down and fixed. The PGP signatures seem to be correct. BR, Jukka Zitting
Re: [VOTE] Mahout 0.1
On Mar 30, 2009, at 12:53 PM, Jukka Zitting wrote: Hi, On Sat, Mar 28, 2009 at 11:49 AM, Grant Ingersoll gsing...@apache.org wrote: Please review and vote for releasing Mahout 0.1. -1 I'm getting conflicting checksums for the mahout-0.1-project.tar.gz package that I'm using to review the sources. The checksums of my download are: MD5: f7341a0eb773d9f96ea87914f3c845dc SHA1: 5c84eb607c07119aceb1f3824e38205474dabc08 But the .md5 and .sha1 files claim the following: MD5: 46fd344aa6e88b7943e988dddaccd121 SHA1: 5b8bd6840718d2546b93656a3cc1528e7dfda75d Hmm, isn't that auto-generated by Maven? I'll look into it. Thanks for checking, Jukka.
Re: [gsoc] random forests
I suggest that we all learn from the experience you are about to have on the reference implementation. And, yes, I did mean the reference implementation when I said non-parallel. Thanks for clarifying. On Mon, Mar 30, 2009 at 10:45 AM, deneche abdelhakim a_dene...@yahoo.frwrote: What do you suggest ? And just to make sure, by 'non-paralel implementation' you mean the reference implementation, right ? -- Ted Dunning, CTO DeepDyve