The debate above seems pretty complete. What are positive actions that will make Mahout healthier?
Suggestions from debate: * Automated patch testing. This would cure 'rotting patch' problem. * Chivvying contributors for detailed notes. * ? Personal concepts: * Regression suite with real data. There have been cases of "Three Cs" batch jobs slowly or quickly drifting from good outputs. A separate suite which exercises the various algorithms with real data would help catch these. * Regression suite of the Mahout In Action code. Books really help a project, and their code goes stale. Some way to keep the MIA examples fresh. On 10/22/11, Dmitriy Lyubimov <[email protected]> wrote: > I feel like I am most closely aligned with Grant. Very little to add. > > Like it or not, Mahout is a library, not a coherent product such as hbase. > It's a collection of algorithms connectied together with some fairly thin > structure and persitence glue, but thenglue rarely can go much beyond that. > That naturally presents difficulties with support as not every committer is > broadly qualified to advise on any algorithm (as opposed for example to > hbase which is pretty much a single product and therefore is much easier to > gain proficiency in). > > If we look around at ml projects, e.g. bugs, wopal wabbit, libsvn, they all > seem to revolve around single area of ml. Hence they get support in that > area. There are few exceptions like weka but they revolve around "non-big' > data and therefore use well known approaches whereas Mahout almost always > requires an added value to make a method scalable. That added value is > rarely resulting in a published paper or even descently reviewed working > notes, which makes support of the thing even more difficult. > > Hence, few thoughts. > 1 request and review more or less detailed working notes from the > contrjbutor before he vanishes from radar. > > 2 don't get upset by multiplicity of open jiras. If jjra sits around and not > fixed for the upcoming release, just create a special 'backlog' fix target > and throw it there until the author provides more information. > > 3 I suggest review some contributions from practicality point of view. I.e. > if the author had concrete need for his contribution and was using it > himself, take more favourable view of it. It would result in majority of > contributions being focused on most common pragmatic need, rather than being > a technology in a search of a problem. (That's btw how my code got evolved, > I coded it not because I had an itch, but because I needed Mr based lsa > solution). In other words, pragmatically necessary things tend to get more > chance of being finished and improved upon naturally. But they still may > take months and even years to evolve to a nicely optimized solution, so no > need to nix something right away. Just throw it in backlog, and unless > author does not reappear in as much as 18 months, dont nix it, just let it > sit in backlog limbo. These things often don't come up easy.(to me anyway). > > 4 even though we may not fully understand the method, we still may create > some standard requirements for the contributions. I already mentioned > working notes. But we may also ask to define standard characteristics, such > as number of Map reduce iterations required, parallelization sttategy, > flops. It would be ideal if we could also find a way to do and publish a > standard benchmarks on say 10G input just to see if it smells. It would help > (me at least) if this data along with maturity level were published in wiki. > Also request a method tutorial from the contributor written to wiki. > On Oct 22, 2011 10:36 AM, "Benson Margulies" <[email protected]> wrote: > >> Drat: I wrote 'is necessarily a badge of shame' when I meant to write >> 'is not necessarily a badge of shame'. >> > -- Lance Norskog [email protected]
