Participating in GSOC 2009

2009-03-30 Thread Robin Anil
Hi, I am planning to participate again in GSOC this year and I want to
do it again under Mahout. I am finishing up my Masters in CSE from IIT
Kharagpur and will be joining Yahoo! this fall, where I would be seeing a
lot of Hadoop :). I would like to thank all of you for all the help you have
given. This year I want to primarily focus on completing the classification
module for Mahout and to . Till now we have a M/R version of NB Model
building. I want to add the following features

1) Batch classification of flat file documents and flat file model
2) Storing the model using HBase
3) Quick classification of a document using HBase Model.
4) Batch classification using HBase Model

5*) Another option I have for quick model access is to load the key-value
pairs over Memcached servers(I use this currently in my Masters thesis
work). I would like to know if this is something that is aligned with
Mahout's objectives

Apart from this I am looking at the other patches and see where they stand
and would like to attack another classification algorithm and implement the
above features for the same.

Warm Regards
Robin


Re: [VOTE] Mahout 0.1

2009-03-30 Thread Grant Ingersoll

Here's my +1

On Mar 28, 2009, at 5:49 AM, Grant Ingersoll wrote:

[Take 2.  I fixed the NOTICE file, but did not change the artifact  
generation issue for now.]


Please review and vote for releasing Mahout 0.1.  This is our first  
release and is all new code.


The artifacts in are located in:
http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/

The mahout directory contains a tarball/zip of the whole project  
(for building from source)
The core, examples and taste-web directories contain the artifacts  
for each of those components.

The other directories contain various dependencies and artifacts.


Thanks,
Grant




Re: [gsoc] random forests

2009-03-30 Thread deneche abdelhakim

Thank you for your answer, it just made me aware of many hidden-possible-future 
problems with my implementation.

 The first is that for any given application, the odds that
 the data will not fit in a single machine are small, especially if you 
 have an out-of-core tree builder.  Really, really big datasets are
 increasingly common, but are still a small minority of all datasets.

by out-of-core you mean the builder can fetch the data directly from a file 
instead of working from in-memory only (?)

 One question I have about your plan is whether your step (1) involves
 building trees or forests only from data held in memory or whether it 
 can be adapted to stream through the data (possibly several
 times).  If a streaming implementation is viable, then it may well be 
 that performance is still quite good for small datasets due to buffering.

I was planning to distribute the dataset files to all workers using Hadoop's 
DistributedCache. I think that a streaming implementation is feasible, the 
basic tree building algorithm (described here 
http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream 
through the data (either in-memory or from a file) for each node of the tree. 
During this pass, it computes the information gain (IG) for the selected 
variables. 
This algorithm could be improved to compute the IG's for a list of nodes, thus 
reducing the total number of passes through the data. When building the forest, 
the list of nodes comes from all the trees built by the mapper.

 Another way to put this is that the key question is how single node
 computation scales with input size.  If the scaling is relatively linear
 with data size, then your approach (3) will work no matter the data size.
 If scaling shows an evil memory size effect, then your approach (2) 
 would be required for large data sets.

I'll have to run some tests before answering this question, but I think that 
the memory usage of the improved algorithm (described above) will mainly be 
needed to store the IG's computations (variable probabilities...). One way to 
limit the memory usage is to limit the number of tree-nodes computed at each 
data pass. Increasing this limit should reduce the data passes but increase the 
memory usage, and vice versa.

There is still one case that this approach, even out-of-core, cannot handle: 
very large datasets that cannot fit in the node hard-drive, and thus must be 
distributed across the cluster.

abdelHakim
--- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Lundi 30 Mars 2009, 0h59
 I have two answers for you.
 
 The first is that for any given application, the odds that
 the data will not
 fit in a single machine are small, especially if you have
 an out-of-core
 tree builder.  Really, really big datasets are
 increasingly common, but are
 still a small minority of all datasets.
 
 The second answer is that the odds that SOME mahout
 application will be too
 large for a single node are quite high.
 
 These aren't contradictory.  They just describe the
 long-tail nature of
 problem sizes.
 
 One question I have about your plan is whether your step
 (1) involves
 building trees or forests only from data held in memory or
 whether it can be
 adapted to stream through the data (possibly several
 times).  If a streaming
 implementation is viable, then it may well be that
 performance is still
 quite good for small datasets due to buffering.
 
 If streaming works, then a single node will be able to
 handle very large
 datasets but will just be kind of slow.  As you point
 out, that can be
 remedied trivially.
 
 Another way to put this is that the key question is how
 single node
 computation scales with input size.  If the scaling is
 relatively linear
 with data size, then your approach (3) will work no matter
 the data size.
 If scaling shows an evil memory size effect, then your
 approach (2) would be
 required for large data sets.
 
 On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote:
 
  My question is : when Mahout.RF will be used in a real
 application, what
  are the odds that the dataset will be so large that it
 can't fit on every
  machine of the cluster ?
 
  the answer to this question should help me decide
 which implementation I'll
  choose.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 
 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 www.deepdyve.com
 408-773-0110 ext. 738
 858-414-0013 (m)
 408-773-0220 (fax)
 





Re: [VOTE] Mahout 0.1

2009-03-30 Thread Sean Owen
+1

On Mon, Mar 30, 2009 at 6:50 PM, Grant Ingersoll gsing...@apache.org wrote:
 Here's my +1

 On Mar 28, 2009, at 5:49 AM, Grant Ingersoll wrote:

 [Take 2.  I fixed the NOTICE file, but did not change the artifact
 generation issue for now.]

 Please review and vote for releasing Mahout 0.1.  This is our first
 release and is all new code.

 The artifacts in are located in:
 http://people.apache.org/~gsingers/staging-repo/mahout/org/apache/mahout/

 The mahout directory contains a tarball/zip of the whole project (for
 building from source)
 The core, examples and taste-web directories contain the artifacts for
 each of those components.
 The other directories contain various dependencies and artifacts.


 Thanks,
 Grant




[jira] Created: (MAHOUT-112) Maven jetty plugin has been relocated

2009-03-30 Thread Jukka Zitting (JIRA)
Maven jetty plugin has been relocated
-

 Key: MAHOUT-112
 URL: https://issues.apache.org/jira/browse/MAHOUT-112
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.1
Reporter: Jukka Zitting


Seen when building taste-web in the 0.1 release candidate and in the current 
Mahout trunk:

{noformat}
Downloading:
http://repo1.maven.org/maven2/org/mortbay/jetty/maven-jetty-plugin/7.0.0.pre5/maven-jetty-plugin-7.0.0.pre5.jar
[INFO] 
[ERROR] BUILD FAILURE
[INFO] 
[INFO] A required plugin was not found: Plugin could not be found - check that 
the goal name is correct:
Unable to download the artifact from any repository
{noformat}

The Jetty plugin has been relocated from maven-jetty-plugin to 
jetty-maven-plugin.

The following change solves the issue:

{noformat}
Index: taste-web/pom.xml
===
--- taste-web/pom.xml   (Revision 760032)
+++ taste-web/pom.xml   (Arbeitskopie)
@@ -82,7 +82,7 @@
 
   plugin
 groupIdorg.mortbay.jetty/groupId
-artifactIdmaven-jetty-plugin/artifactId
+artifactIdjetty-maven-plugin/artifactId
 configuration
   
webApp${project.build.directory}/${project.artifactId}-${project.version}.war/webApp
 /configuration
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-113) CDInfosToolTest.testGatherInfos failure in Mahout examples

2009-03-30 Thread Jukka Zitting (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jukka Zitting updated MAHOUT-113:
-

Attachment: 
org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest-output.txt

Attached test output.

 CDInfosToolTest.testGatherInfos failure in Mahout examples
 --

 Key: MAHOUT-113
 URL: https://issues.apache.org/jira/browse/MAHOUT-113
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.1
 Environment: Maven version: 2.0.9
 Java version: 1.6.0_07
 OS name: linux version: 2.6.26.6-79.fc9.i686 arch: i386 Family: unix
Reporter: Jukka Zitting
Priority: Minor
 Attachments: 
 org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest-output.txt


 I'm getting the following test failure when running mvn clean install on a 
 fresh checkout of Mahout trunk:
 {noformat}
 ---
 Test set: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest
 ---
 Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.828 sec  
 FAILURE!
 testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)  
 Time elapsed: 1.798 sec   FAILURE!
 junit.framework.AssertionFailedError: expected:48 but was:46
 at junit.framework.Assert.fail(Assert.java:47)
 at junit.framework.Assert.failNotEquals(Assert.java:280)
 at junit.framework.Assert.assertEquals(Assert.java:64)
 at junit.framework.Assert.assertEquals(Assert.java:198)
 at junit.framework.Assert.assertEquals(Assert.java:204)
 at 
 org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest.testGatherInfos(CDInfosToolTest.java:207)
 {noformat}
 I'll attach the test output file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [gsoc] random forests

2009-03-30 Thread Ted Dunning
Indeed.  And those datasets exist.

It is also plausible that this full data scan approach will fail when you
want the forest building to take less time.

It is also plausible that a full data scan approach fails to improve enough
on a non-parallel implementation.  This would happen if a significantly
large fraction of the entire forest could be built on a single node.  That
would happen if the CPU requirements for forest building are overshadowed by
the I/O cost of scanning the data set.  This would imply that there is a
small limit to the amount of parallelism that would help.

You will know much more about this after you finish the non-parallel
implementation than either of us knows now.

On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 There is still one case that this approach, even out-of-core, cannot
 handle: very large datasets that cannot fit in the node hard-drive, and thus
 must be distributed across the cluster.




-- 
Ted Dunning, CTO
DeepDyve


Re: [VOTE] Mahout 0.1

2009-03-30 Thread Jukka Zitting
Hi,

On Sat, Mar 28, 2009 at 11:49 AM, Grant Ingersoll gsing...@apache.org wrote:
 Please review and vote for releasing Mahout 0.1.

-1 I'm getting conflicting checksums for the mahout-0.1-project.tar.gz
package that I'm using to review the sources. The checksums of my
download are:

MD5: f7341a0eb773d9f96ea87914f3c845dc
SHA1: 5c84eb607c07119aceb1f3824e38205474dabc08

But the .md5 and .sha1 files claim the following:

MD5: 46fd344aa6e88b7943e988dddaccd121
SHA1: 5b8bd6840718d2546b93656a3cc1528e7dfda75d

I'm still getting test errors on Windows, but now I also built the
release candidate on my Linux desktop where Mahout core passes all
tests. Note however the issues MAHOUT-112 and MAHOUT-113 which I
encountered in the taste-web and examples modules.

Other than those the release looks good. Thanks for resolving some of
the issues I raised earlier. I'd be willing to vote +1 even without a
new release candidate if the checksum issues can be tracked down and
fixed. The PGP signatures seem to be correct.

BR,

Jukka Zitting


Re: [VOTE] Mahout 0.1

2009-03-30 Thread Grant Ingersoll


On Mar 30, 2009, at 12:53 PM, Jukka Zitting wrote:


Hi,

On Sat, Mar 28, 2009 at 11:49 AM, Grant Ingersoll  
gsing...@apache.org wrote:

Please review and vote for releasing Mahout 0.1.


-1 I'm getting conflicting checksums for the mahout-0.1-project.tar.gz
package that I'm using to review the sources. The checksums of my
download are:

   MD5: f7341a0eb773d9f96ea87914f3c845dc
   SHA1: 5c84eb607c07119aceb1f3824e38205474dabc08

But the .md5 and .sha1 files claim the following:

   MD5: 46fd344aa6e88b7943e988dddaccd121
   SHA1: 5b8bd6840718d2546b93656a3cc1528e7dfda75d


Hmm, isn't that auto-generated by Maven?  I'll look into it.

Thanks for checking, Jukka.


Re: [gsoc] random forests

2009-03-30 Thread Ted Dunning
I suggest that we all learn from the experience you are about to have on the
reference implementation.

And, yes, I did mean the reference implementation when I said
non-parallel.  Thanks for clarifying.

On Mon, Mar 30, 2009 at 10:45 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 What do you suggest ?

 And just to make sure, by 'non-paralel implementation' you mean the
 reference implementation, right ?




-- 
Ted Dunning, CTO
DeepDyve