To add to this and to save the breath of the participants in the formerly private discussion, it seems like there is rough consensus about removing cruft, but there has also been quite a bit of desire to be very sensitive to the needs of current and planned production users. Somewhat less intense is a desire to be sensitive to needs of less critical uses.
On Tue, May 8, 2012 at 8:11 AM, Robin Anil <[email protected]> wrote: > Based on some discussion on the private group about where Mahout is > faltering in the real world, a stream of thought bubbled up - Make Mahout > leaner. i.e push the best stuff we have to the top and prune out algorithms > that are underperforming. The main issue here is that Iterative nature of > many of the algorithms make it inefficient to be implemented on top of > current Hadoop. The summary or the state of the disucssion so far > > 1) Focus on large scale data(not medium scale) and focus on algorithms that > run at *almost* O(n). > 2) Focus on deployability and less on making it an analysis tool for data > competitions. > 3) Prune prune prune things that are not being maintained. > > The following is one way of looking at Mahout and the state of its > algorithms. Let us know if you would like something to be in the keeper > category. > > Keepers > 1. Recommenders -- clearly a keeper > 2. SGD > 3. LDA > 4. Some clustering (with upgrades) > 5. Math + collections > 6. Hadoop Utilities + Integration -- I know it's silly, but things like > sequence file dumper, the iterators, etc. are handy in a number of places. > 7. SVD and related > 8 RowSimilarity > 9. Some of the upfront preprocessing tools (Lucene, Text , etc.) > > Unsure: > > - Bayes + Random Forest - Seems a shame on bayes, since it gives a > baseline, but I don't know that it actually works and then there's the > whole split personality nature of it (text-based and vector-based) > - Collocations - I'd say keep for now, even if just for selfish reasons > - Minhash - every time I look at it is seems broken and the original author > doesn't respond to requests for explanation. > - Freq. Item Set - Tom's done some work to clean up and I've tried it on > search logs and the results looked OK, but no formal evaluation. I've seen > others say why not just do simpler co-occurrence stuff... > > Drop for sure: > > 1. Watchmaker > 2. Unused/poor examples > 3. Probably a lot more that escapes me at the moment. > 4. PageRank > ------ > Robin Anil >
