2) Model formats
Proposal: a few common structures with higher-level conventions about how to
compose them.
.
For matrix data, the R "dataframe" is a time-tested format for dense
vectors, matrices and tensors. Something like this that also handles most
sparsity cases would allow ditching a lot of hard-coded formats.

We would need a counterpart format for discrete data structures like graphs,
fpgrowth etc. If there are none in the public sphere, here is one: an object
with two lists, each with a label. This can represent one node or edge of a
graph. To read in the graph you would need to fill hashtables from the
labels. Add a double and you have a weighted graph. Call it a "bundle".

FPGrowth uses a more complex data structure. This provides 2 use cases:
1) a hard use case for composing its data with a simpler object, because you
have to save the simple objects with metadata that lets you read and
reconstitute.
2) a simpler use case is saving "flattened" variations of the full data
structure as a stream of bundles.

On Sat, Oct 29, 2011 at 8:45 PM, Isabel Drost <isa...@apache.org> wrote:

>
> Mahout seems to be at a stage where we have covered most of the interesting
> machine learning problems, where it is being used in production by quite
> some
> developers - hey, we even got a book that is now available in a printed
> version.
>
> Maybe it's time to start taking first steps towards a 1.0 release. One*
> important step in my opinion is to define what kind of backwards
> compatibility
> guarantees we want to give our users - and what guarantees our users really
> need
> - after releasing 1.0.
>
> Just a rough list below - feel free to extend, shrink and change:
>
> 1) Data input formats - people probably do not want to re-generate vectors
> from
> their original data every time they use a new Mahout version.
>
> 2) Model formats - people probably do not want to have to retrain a model
> only
> to make it work with the latest and greatest features of a new Mahout
> release.
>
> 3) Model output - when upgrading users probably want to receive model
> output
> that is then integrated in their system the same way as with the older
> relase.
>
> 4) APIs - I don't see us keeping all interfaces or even abstract classes
> stable.
> However users should know which APIs we consider "public facing" and will
> likely
> keep stable. Maybe an annotation makes that clear?
>
> 5) Command line scripts - is there a significant user base relying on the
> bin/mahout script to warrant working towards keeping that stable between
> releases?
>
> Most likely I've forgotten about other vital pieces - just wanted to kick
> off
> that discussion.
>
>
> Isabel
>
>
> * though not the only one - others include but are not limited to the time
> frame
> for which we offer support for any given release.
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to