Speed up Frequent Compile

2010-02-05 Thread Robin Anil
When developing mahout core/util/examples we dont need to generate math often and dont need to tar gzip bzip2 the jar files. We are mostly concerned with the job file/ jar file. Cant there be another target like develop which does this. (waiting 2-3 mins for a 2 line change is frustrating) Robin

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Attachment: MAHOUT-237-tfidf.patch 4 Main Entry points DocumentProcessor - does SequenceFile => StringT

Re: Mahout 0.3 Plan and other changes

2010-02-05 Thread Robin Anil
I am committing the first level of changes so that drew can work it. I have updated the patch on the issue as a reference. Ted please take a look when you get time. The names will change correspondingly What I have right now is 4 Main Entry points DocumentProcessor - does SequenceFile => StringTu

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Status: Patch Available (was: Reopened) Working Implementation DictionaryVectorizer using with tf, tfi

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Resolution: Fixed Status: Resolved (was: Patch Available) > Map/Reduce Implementation of Docum

[jira] Resolved: (MAHOUT-220) Mahout Bayes Code cleanup

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-220. --- Resolution: Fixed Committed. > Mahout Bayes Code cleanup > - > >

[jira] Resolved: (MAHOUT-221) Implementation of FP-Bonsai Pruning for fast pattern mining

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-221. --- Resolution: Fixed Committed > Implementation of FP-Bonsai Pruning for fast pattern mining > ---

[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830056#action_12830056 ] Robin Anil commented on MAHOUT-153: --- Any progress on this? Will it be ready soon or shoul

Re: Release thinking

2010-02-05 Thread Robin Anil
Reviving this thread. Copy paste the whole thing as we move forward Current Snapshot Key Summary > MAHOUT-221 Implementation of FP-Bonsai Pruning for fast pattern mining >Done > MAHOUT-227 Parallel SVM In Progress > MAHOUT-240 Parallel version of Perceptron Little Progr

[jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-02-05 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830077#action_12830077 ] Robin Anil commented on MAHOUT-185: --- I like the script as i am running k-means these days

Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Grant Ingersoll
One thought on these lines is that we should start the process to be a TLP, then we could have a subproject explicitly dedicated to C++ (or any other language) and there wouldn't necessarily need to be a 1-1 port. -Grant On Feb 5, 2010, at 12:56 AM, Kay Kay wrote: > If there were an effort to

Re: Release thinking

2010-02-05 Thread Ted Dunning
I just marked the 0.1 and 0.2 releases as released (about time). This makes the JIRA road map feature more usable. See here for the live version of this summary: https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel On Fri, Feb 5, 2010 at 3:16

Re: [jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-02-05 Thread Ted Dunning
Surely there is a clever way to use annotations for this. Not that I know what it might be. On Fri, Feb 5, 2010 at 4:05 AM, Robin Anil (JIRA) wrote: > If we go like this we might have too many options. Any way to streamline > this ? > > One thought i have is to have package level Main classes i

Re: Release thinking

2010-02-05 Thread Robin Anil
Yum Yum. 0.1 59 issues 0.2 66 issues 0.3 91 issues - 13 left On Fri, Feb 5, 2010 at 9:47 PM, Ted Dunning wrote: > I just marked the 0.1 and 0.2 releases as released (about time). This > makes > the JIRA road map feature more usable. > > See here for the live version of this summary: > > ht

[jira] Created: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-05 Thread Drew Farris (JIRA)
Use avro for serialization of structured documents. --- Key: MAHOUT-274 URL: https://issues.apache.org/jira/browse/MAHOUT-274 Project: Mahout Issue Type: Improvement Reporter: Drew

[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-05 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-274: --- Attachment: mahout-avro-examples.tar.gz Very rudimentary exploration of using avro to produce writabl

Re: Release thinking

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 11:17 AM, Ted Dunning wrote: > I just marked the 0.1 and 0.2 releases as released (about time).  This makes > the JIRA road map feature more usable. > > See here for the live version of this summary: > https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.pl

Re: Speed up Frequent Compile

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil wrote: > When developing mahout core/util/examples we dont need to generate math > often and dont need to tar gzip bzip2 the jar files. We are mostly concerned > with the job file/ jar file. > Cant there be another target like develop which does this. (wa

Re: Release thinking

2010-02-05 Thread Jake Mannix
So are we really planning on all this structured document stuff and Avro for 0.3? Can we just try and finish up what was already scoped for 0.3 and have a quick turnaround for getting things which have only been really started worked on in the past week or so for 0.4 sometime next month? -jake

Re: Speed up Frequent Compile

2010-02-05 Thread Ted Dunning
I usually do an initial compilation using mvn package. Then, during development I use IntelliJ's incremental compilation which generally only takes a few seconds. Since that compilation doesn't handle things like copying resources, I get caught out and surprised now and again, but this works almo

Re: Release thinking

2010-02-05 Thread Ted Dunning
Makes a lot of sense. Drew? On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix wrote: > So are we really planning on all this structured document stuff and Avro > for > 0.3? Can we just try and finish up what was already scoped for 0.3 and > have > a quick turnaround for getting things which have onl

Re: Release thinking

2010-02-05 Thread Jake Mannix
On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix wrote: > So are we really planning on all this structured document stuff and Avro > for 0.3? Can we just try and finish up what was already scoped for 0.3 and > have a quick turnaround for getting things which have only been really > started worked on

Re: Release thinking

2010-02-05 Thread Drew Farris
Sounds great to me. On Fri, Feb 5, 2010 at 11:50 AM, Ted Dunning wrote: > Makes a lot of sense.  Drew? > > On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix wrote: > >> So are we really planning on all this structured document stuff and Avro >> for >> 0.3?  Can we just try and finish up what was alrea

Re: Release thinking

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 11:53 AM, Jake Mannix wrote: > > Which is not to say that we shouldn't continue work on them, let's keep the > patches going and up to date, let's just not worry about holding up 0.3 > until they're fully tested and checked in. Yes absolutely. I'm also interested in hearin

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
mvn install to generate the job. around 2-3 mins it generates the bz2 zip gz mvn compile otherwise(15 secs are in compiling math) out of 33 sec On Fri, Feb 5, 2010 at 10:18 PM, Drew Farris wrote: > On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil wrote: > > When developing mahout core/util/examples

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Yes for editing i use eclipse in the same fashion. If i want to try out a job and see how it performs on hadoop I need job compiled fast. On another note. I think there will be a lot of dead code in the job(with all the jar files bundles) Is there an optimiser for that i.e to remove classes which

Re: Release thinking

2010-02-05 Thread Robin Anil
I just updated it here. http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html Lets rename/refactor the classes and get basic avro thing in for 0.3. So that people who use gets a smooth upgrade to 0.4 Robin On Fri, Feb 5, 2010 at 10:32 PM, Drew Farris wrote: > On Fri, Feb 5, 2010 at 1

[jira] Updated: (MAHOUT-272) Add licenses for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-272: - Resolution: Fixed Assignee: Drew Farris Status: Resolved (was: Patch Available) > Add lice

Re: Speed up Frequent Compile

2010-02-05 Thread Drew Farris
So, I'm running: mvn -o install -DskipTests=true at project root (in mahout) Comment out or remove the maven-assembly-plugin definition in core/pom.xml -- it reduced my core build time from 26s to 6s -- I can submit a patch for this. Mahout math is still 17s here due to code generation. I'm wonde

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Thanks!. 25 seconds is a winner. can decrease it down to 15 if re-compile of parent, math and mojo is turned off. On Fri, Feb 5, 2010 at 10:49 PM, Drew Farris wrote: > So, I'm running: mvn -o install -DskipTests=true at project root (in > mahout) > > Comment out or remove the maven-assembly-plug

Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Israel Ekpo
Thanks everyone for your responses so far. The Apache Hadoop dependency was something I thought about initially but I still went ahead to ask the question anyways. At this time, it would be a better use of resources and time to come up with a wrapper or HTTP server/client set up of some sort. My

Re: Speed up Frequent Compile

2010-02-05 Thread Benson Margulies
Yes, the codegen could drop a timestamp file. It's a fair amount of work, and if we're killing this code for HPCC I'm dubious. If I could make the split work I could do this next. On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris wrote: > So, I'm running: mvn -o install -DskipTests=true at project r

Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Israel Ekpo
Grant, Would the TLP be Mahout or under a different name? I also like the idea that it does not necessarily have to be a 1:1 port. Kay Kay, I change my mind (going the wrapper route), I think it would be nice to explore the possibilities with just a subset of the algorithms. That would be a go

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Its just meant to be a dev only hack :) On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies wrote: > Yes, the codegen could drop a timestamp file. It's a fair amount of > work, and if we're killing this code for HPCC I'm dubious. > > If I could make the split work I could do this next. > > > On Fri

Re: Speed up Frequent Compile

2010-02-05 Thread Benson Margulies
Then we could make a profile that turns off the code gen and turns on the build helper to add the generated source dir instead. On Fri, Feb 5, 2010 at 4:49 PM, Robin Anil wrote: > Its just meant to be a dev only hack :) > > > On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies wrote: > >> Yes, the c

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-05 Thread Jeff Eastman
Jeff Eastman wrote: Jeff Eastman wrote: Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha. Could you elaborate more on

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-05 Thread Jeff Eastman
Jeff Eastman wrote: Jeff Eastman wrote: Jeff Eastman wrote: Ted Dunning wrote: This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha. Could