Re: 0.2
I suppose I have volunteered for the release. What does it entail, making the release? I don't knowledge of this. ... or MAHOUT-114 or what it means to sign these jars? If info is available I can try to figure these out. On Thu, Oct 15, 2009 at 10:19 AM, Grant Ingersoll wrote: > OK. The Sparse vector improvements we have now are already a lot faster > than what was in 0.1, so that is good. I'd suggest that whoever is the > Release Mgr. for this release takes care of the signing stuff. I'll look at > the Label (LLR) stuff by Monday. >
Re: LDA for multi label classification was: Mahout Book
Sorry, this slipped out of my inbox and I just found it! On Thu, Oct 8, 2009 at 12:05 PM, Robin Anil wrote: > Posting to the dev list. > Great Paper Thanks!. Looks like L-LDA could be used to create some > interesting examples. Thanks! > The Paper shows L-LDA could be used to creating word-tag model for accurate > tag(s) prediction given a document of words. I will complete reading and > tell > How much work is need to transform/build on top of current LDA > implementation to L-LDA. any thoughts? Umm, cool! In the paper we used Gibbs sampling to do the inference, and the implementation in Mahout uses variational inference (because it distributes better). I don't see any obvious problems in terms of math, and so the rest is just fitting it in the system. I think a small amount of refactoring would be in order to make things more generic, and then it shouldn't be too hard to plug in. I'll add it to my list, but I'm swamped for quite some time. -- David > Robin > On Thu, Oct 8, 2009 at 11:50 PM, David Hall wrote: >> >> The short answer is, that it probably won't help all that much. Naive >> Bayes is unreasonably good when you have enough data. >> >> The long answer is, I have a paper with Dan Ramage and Ramesh >> Nallapati that talks about how to do it. >> >> www.aclweb.org/anthology-new/D/D09/D09-1026.pdf >> >> In some sense, "Labeled-LDA" is a kind of Naive Bayes where you can >> have more than one class per document. If you have exactly one class >> per document, then LDA reduces to Naive Bayes (or the unsupervised >> variant of naive bayes which is basically k-means in multinomial >> space). If instead you wanted to project W words to K topics, with K > >> numWords, then there is something to do... >> >> That something is something like: >> >> 1) get p(topic|word,document) for each word in each document (which is >> output by LDAInference). Those are your expected counts for each >> topic. >> >> 2)For each class, do something like: >> p(topic|class) \propto \sum_{document with that class,word} >> p(topic|word,document) >> >> Then just apply bayes rule to do classification: >> >> p(class|topics,document) \propto p(class) \prod p(topic|class,document) >> >> -- David >> >> On Thu, Oct 8, 2009 at 11:07 AM, Robin Anil wrote: >> > Thanks. Didnt see that, Fixed it!. >> > I have a query >> > How is the LDA topic model used to improve a classifier. Say Naive >> > Bayes? If >> > its possible, then I would like to integrate it into mahout. >> > Given m classes and the associated documents, One can build m topic >> > models >> > right. (set of topics(words) under each label and the associated >> > probability >> > distribution of words). >> > How can i use that info weight the most relevant topic of a class ? >> > >> > >> >> >> LDA has two meanings: linear discriminant analysis and latent >> >> dirichlet allocation. My code is the latter. The former is a kind of >> >> classification. You say linear discriminant analysis in the outline. >> >> >> > >
[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains
[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766231#action_12766231 ] Grant Ingersoll commented on MAHOUT-165: Shashi's vectors are at: http://people.apache.org/~gsingers/mahout/vectors-test-mahout.gz. > Using better primitives hash for sparse vector for performance gains > > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix >Affects Versions: 0.2 >Reporter: Shashikant Kore >Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: colt.jar, mahout-165-trove.patch, > MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.2
OK. The Sparse vector improvements we have now are already a lot faster than what was in 0.1, so that is good. I'd suggest that whoever is the Release Mgr. for this release takes care of the signing stuff. I'll look at the Label (LLR) stuff by Monday. On Oct 15, 2009, at 1:02 PM, Jeff Eastman wrote: I'd vote to delay 165 for 0.3 but do it in trunk asap after 0.2 so folks can get their hands on it. Sean Owen wrote: It still sounds somewhat significant to me. Either it's rushed or takes a while and both seem negative. +1 This is why I think it is vital, at least, to put a schedule on this, or else we are basically saying 0.2 is to not be released indefinitely, and that's no good. Last time we said we'd finish up and release this was 2 weeks ago, and there hasn't been progress on this issue. I'm starting to feel strongly enough to call for a vote? On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll wrote: I don't think it is that big. We can likely just make another implementation of Vector. We don't have to convert everything to Colt.
Re: 0.2
I'd vote to delay 165 for 0.3 but do it in trunk asap after 0.2 so folks can get their hands on it. Sean Owen wrote: It still sounds somewhat significant to me. Either it's rushed or takes a while and both seem negative. +1 This is why I think it is vital, at least, to put a schedule on this, or else we are basically saying 0.2 is to not be released indefinitely, and that's no good. Last time we said we'd finish up and release this was 2 weeks ago, and there hasn't been progress on this issue. I'm starting to feel strongly enough to call for a vote? On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll wrote: I don't think it is that big. We can likely just make another implementation of Vector. We don't have to convert everything to Colt. PGP.sig Description: PGP signature
Re: 0.2
On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll wrote: > > On Oct 15, 2009, at 8:22 AM, Sean Owen wrote: > > On Thu, Oct 15, 2009 at 4:57 AM, Grant Ingersoll >> wrote: >> >>> MAHOUT-165 Using better primitives hash for sparse vector for performance gainsOpen14/Oct/09 Per discussion, move the remainder (migration to Colt or something) to 0.3 >>> >>> I will try to get to this, as I think it is important. >>> >> >> I agree with Jeff that the migration to a new framework is a big >> change and should be left to 0.3. (Vote?) There is a whole lot of >> change already, more than might normally go into a point release. >> Since you have another blocker below, and limited time, I say don't >> kill yourself to work on this. It's going to be hard to get it done in >> a weekend. >> >> > > I don't think it is that big. We can likely just make another > implementation of Vector. We don't have to convert everything to Colt. > Ted's patch (since monkeyed with my you and myself) has the other implementation of Vector, but testing showed it's slower? This patch also had a significant refactoring of the Vector hierarchy so it's not just "a new class". I'm all for getting this in as soon as we can, because this issue (well, finalizing on a linear api) pretty much blocks my donating decomposer to Mahout, but it looks like you're the only one who feels strongly about resolving M-165 for 0.2, Grant. Can we not just have 0.3 in another 6-8 weeks or so which covers this? What Mahout user is getting blocked by having too-slow sparse vectors currently? -jake
[jira] Resolved: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing
[ https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost resolved MAHOUT-138. - Resolution: Fixed Fix Version/s: (was: 0.3) 0.2 The last ci changed the remaining classes - so at least grep does not find any usages of 'args\[' anywhere in our source code. > Convert main() methods to use Commons CLI for argument processing > - > > Key: MAHOUT-138 > URL: https://issues.apache.org/jira/browse/MAHOUT-138 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Fix For: 0.2 > > Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch > > > Commons CLI is in the classpath and makes it much easier to handle command > line args and they are more self-documenting when done right. We should > convert our main methods to use CLI -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.2
It still sounds somewhat significant to me. Either it's rushed or takes a while and both seem negative. I think it is vital, at least, to put a schedule on this, or else we are basically saying 0.2 is to not be released indefinitely, and that's no good. Last time we said we'd finish up and release this was 2 weeks ago, and there hasn't been progress on this issue. I'm starting to feel strongly enough to call for a vote? On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll wrote: > I don't think it is that big. We can likely just make another > implementation of Vector. We don't have to convert everything to Colt.
Re: 0.2
On Oct 15, 2009, at 8:22 AM, Sean Owen wrote: On Thu, Oct 15, 2009 at 4:57 AM, Grant Ingersoll wrote: MAHOUT-165 Using better primitives hash for sparse vector for performance gainsOpen14/Oct/09 Per discussion, move the remainder (migration to Colt or something) to 0.3 I will try to get to this, as I think it is important. I agree with Jeff that the migration to a new framework is a big change and should be left to 0.3. (Vote?) There is a whole lot of change already, more than might normally go into a point release. Since you have another blocker below, and limited time, I say don't kill yourself to work on this. It's going to be hard to get it done in a weekend. I don't think it is that big. We can likely just make another implementation of Vector. We don't have to convert everything to Colt. MAHOUT-114 Release Process Needs to sign published dependencies such as Hadoop, etc. Open06/Apr/09 Not clear on status here, mark as 0.3? This is a blocker for 0.2 and thus must be completed. That being said, I think Hadoop is now publishing to the Maven repo, so we may be able to stop our own publishing of Hadoop.
Re: 0.2
On Thu, Oct 15, 2009 at 4:57 AM, Grant Ingersoll wrote: >> MAHOUT-165 Using better primitives hash for sparse vector for >> performance gains Open 14/Oct/09 >> >> Per discussion, move the remainder (migration to Colt or something) to 0.3 > > I will try to get to this, as I think it is important. I agree with Jeff that the migration to a new framework is a big change and should be left to 0.3. (Vote?) There is a whole lot of change already, more than might normally go into a point release. Since you have another blocker below, and limited time, I say don't kill yourself to work on this. It's going to be hard to get it done in a weekend. >> MAHOUT-114 Release Process Needs to sign published >> dependencies such >> as Hadoop, etc. Open 06/Apr/09 >> >> Not clear on status here, mark as 0.3? > > This is a blocker for 0.2 and thus must be completed. That being said, I > think Hadoop is now publishing to the Maven repo, so we may be able to stop > our own publishing of Hadoop. > >
Re: 0.2
On Oct 15, 2009, at 7:21 AM, Sean Owen wrote: Here's what is marked 0.2 plus suggested actions. I am basically suggesting the things that are 'pretty ready' be submitted and published -- if they're 85% done, definitely good enough for an 0.2 release, and worth getting them play-tested. (Or else, decide they need another month or two, and mark for 0.3) And then that takes care of just about everything for 0.2 MAHOUT-163 Get (better) cluster labels using Log Likelihood Ratio Open 17/Sep/09 No recent action here, but seemed ready enough to submit as of last patch. Do so or mark 0.3? I will make sure to get this one in before the release. MAHOUT-171 Move deployment to repository.apache.org Open02/Oct/09 Seems ready to submit? +1. It would be great to have Maven snapshots available for nightly builds. MAHOUT-185 Add mahout shell script for easy launching of various algorithms Open06/Oct/09 Very new, sounds like something for 0.3 This is mostly for convenience. Would be nice to have in 0.2, but not a show stopper. MAHOUT-170 Enable Java compile optimize flag during build Open 07/Oct/09 Go ahead and submit? the original change seemed quite uncontroversial. Robin suggested a further change. Either submit or mark 0.3 MAHOUT-186 Classifier PriorityQueue returns erroneous results Patch Available 08/Oct/09 Two patches available. I would like my patch for this issue to get some feedback -- would prefer it be submitted or some even better hybrid of it and the first patch. MAHOUT-148 Convert Classification Algs to use richer Writable syntax Patch Available 10/Oct/09 Ready to submit? MAHOUT-157 Frequent Pattern Mining using Parallel FP-Growth Patch Available13/Oct/09 Seems like still work in progress. If it's 'good enough', submit and continue iterating. Or mark 0.3 MAHOUT-165 Using better primitives hash for sparse vector for performance gainsOpen14/Oct/09 Per discussion, move the remainder (migration to Colt or something) to 0.3 I will try to get to this, as I think it is important. MAHOUT-106 PLSI/EM in pig based on hofmann's ACM 04 paper. Patch Available27/Aug/09 This looks like something better tagged as 'unknown version'; don't understand the status I had hoped to do this, but let's move it to 0.3 MAHOUT-114 Release Process Needs to sign published dependencies such as Hadoop, etc. Open06/Apr/09 Not clear on status here, mark as 0.3? This is a blocker for 0.2 and thus must be completed. That being said, I think Hadoop is now publishing to the Maven repo, so we may be able to stop our own publishing of Hadoop.
[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth
[ https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766030#action_12766030 ] Isabel Drost commented on MAHOUT-157: - The patch looks good to me. Good work Robin. > Frequent Pattern Mining using Parallel FP-Growth > > > Key: MAHOUT-157 > URL: https://issues.apache.org/jira/browse/MAHOUT-157 > Project: Mahout > Issue Type: New Feature > Components: Frequent Itemset/Association Rule Mining >Affects Versions: 0.2 >Reporter: Robin Anil >Assignee: Robin Anil > Fix For: 0.2 > > Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, > MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, > MAHOUT-157-codecleanup-javadocs.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-Combinations-BSD-License.patch, > MAHOUT-157-CompactTransactionMapperFormat.patch, MAHOUT-157-final.patch, > MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, > MAHOUT-157-Oct-10.pfpgrowth.patch, MAHOUT-157-Oct-8.pfpgrowth.patch, > MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, > MAHOUT-157-Oct-9.StreamingDBRead-Inprogress.patch, > MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, > MAHOUT-157-September-5.patch > > > Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: 0.2
Here's what is marked 0.2 plus suggested actions. I am basically suggesting the things that are 'pretty ready' be submitted and published -- if they're 85% done, definitely good enough for an 0.2 release, and worth getting them play-tested. (Or else, decide they need another month or two, and mark for 0.3) And then that takes care of just about everything for 0.2 MAHOUT-163 Get (better) cluster labels using Log Likelihood Ratio Open 17/Sep/09 No recent action here, but seemed ready enough to submit as of last patch. Do so or mark 0.3? MAHOUT-171 Move deployment to repository.apache.org Open02/Oct/09 Seems ready to submit? MAHOUT-185 Add mahout shell script for easy launching of various algorithms Open06/Oct/09 Very new, sounds like something for 0.3 MAHOUT-170 Enable Java compile optimize flag during build Open07/Oct/09 Go ahead and submit? the original change seemed quite uncontroversial. Robin suggested a further change. Either submit or mark 0.3 MAHOUT-186 Classifier PriorityQueue returns erroneous results Patch Available 08/Oct/09 Two patches available. I would like my patch for this issue to get some feedback -- would prefer it be submitted or some even better hybrid of it and the first patch. MAHOUT-148 Convert Classification Algs to use richer Writable syntax Patch Available 10/Oct/09 Ready to submit? MAHOUT-157 Frequent Pattern Mining using Parallel FP-Growth Patch Available13/Oct/09 Seems like still work in progress. If it's 'good enough', submit and continue iterating. Or mark 0.3 MAHOUT-165 Using better primitives hash for sparse vector for performance gainsOpen14/Oct/09 Per discussion, move the remainder (migration to Colt or something) to 0.3 MAHOUT-106 PLSI/EM in pig based on hofmann's ACM 04 paper. Patch Available27/Aug/09 This looks like something better tagged as 'unknown version'; don't understand the status MAHOUT-114 Release Process Needs to sign published dependencies such as Hadoop, etc. Open06/Apr/09 Not clear on status here, mark as 0.3? On Mon, Oct 12, 2009 at 11:05 AM, Sean Owen wrote: > I am ready too. Same question, what is left that must block 0.2 and what is > the ETA looking like? > > On Oct 12, 2009 6:07 PM, "Robin Anil" wrote: > > Everything looks good from my side. I will work on the launcher and tidying > up Bayes classifier, the next couple of days. Any idea on a target date? If > there is time, I would like to spend those precious amazon credits to > register some performance numbers. > Robin > > On Tue, Oct 6, 2009 at 5:53 PM, Isabel Drost wrote: > On > Tue, 6 Oct 2009 17:36...