2) Model formats Proposal: a few common structures with higher-level conventions about how to compose them. . For matrix data, the R "dataframe" is a time-tested format for dense vectors, matrices and tensors. Something like this that also handles most sparsity cases would allow ditching a lot of hard-coded formats.
We would need a counterpart format for discrete data structures like graphs, fpgrowth etc. If there are none in the public sphere, here is one: an object with two lists, each with a label. This can represent one node or edge of a graph. To read in the graph you would need to fill hashtables from the labels. Add a double and you have a weighted graph. Call it a "bundle". FPGrowth uses a more complex data structure. This provides 2 use cases: 1) a hard use case for composing its data with a simpler object, because you have to save the simple objects with metadata that lets you read and reconstitute. 2) a simpler use case is saving "flattened" variations of the full data structure as a stream of bundles. On Sat, Oct 29, 2011 at 8:45 PM, Isabel Drost <isa...@apache.org> wrote: > > Mahout seems to be at a stage where we have covered most of the interesting > machine learning problems, where it is being used in production by quite > some > developers - hey, we even got a book that is now available in a printed > version. > > Maybe it's time to start taking first steps towards a 1.0 release. One* > important step in my opinion is to define what kind of backwards > compatibility > guarantees we want to give our users - and what guarantees our users really > need > - after releasing 1.0. > > Just a rough list below - feel free to extend, shrink and change: > > 1) Data input formats - people probably do not want to re-generate vectors > from > their original data every time they use a new Mahout version. > > 2) Model formats - people probably do not want to have to retrain a model > only > to make it work with the latest and greatest features of a new Mahout > release. > > 3) Model output - when upgrading users probably want to receive model > output > that is then integrated in their system the same way as with the older > relase. > > 4) APIs - I don't see us keeping all interfaces or even abstract classes > stable. > However users should know which APIs we consider "public facing" and will > likely > keep stable. Maybe an annotation makes that clear? > > 5) Command line scripts - is there a significant user base relying on the > bin/mahout script to warrant working towards keeping that stable between > releases? > > Most likely I've forgotten about other vital pieces - just wanted to kick > off > that discussion. > > > Isabel > > > * though not the only one - others include but are not limited to the time > frame > for which we offer support for any given release. > -- Lance Norskog goks...@gmail.com