Hi,

On Mon, Mar 25, 2013 at 9:10 AM, Sebastian Schelter <s...@apache.org> wrote:

> throwing in my 2 cents here:
>
> IMO, its first and foremost a library (similar to Lucene), and this should
> also be reflected in the codebase.
>

This would be my view as well. It should be easy for people who speak Java
to take the implementations and plug them into their own projects. For
those dealing with text it should be trivial to combine Lucene and its
analyzers for data pre-processing and feed resulting vectors into Mahout
algorithms.



>
> I don't agree that we simply lack manpower but have a clear vision. I
> actually think its the other way round. I think Mahout is kind of stuck,
> because it does not have a clear vision. I think we faced and still face
> very hard challenges, as we have to provide answers for the following
> questions:
>
> * for which problems and algorithms does it really make sense to use
> MapReduce?
>

Being a notorious optimist I'm confident that we should be in a good
position to provide answers for that question now.



>
> * how broad can the spectrum of things that we offer be without a
> decline in quality?
>
> * how do we deal with the fact that our codebase is split up into a
> collection of algorithms with very few people being able to work on all
> of them, due to the required theoretical background and the complexity
> of efficient code
>

One thing that has always been on my mind is to focus on a handful of core
use cases - defined as broadly as "classification" is a use case on its
own. For each use case there should be a limited number of algorithm
implementations. If being parallel is still on our agenda, than for each
use case we should at least have a single machine and a going parallel
story with a clear path for users to scale their application from single to
multiple without to many adjustments in code (if that is at all possible)
or conceptual client side architecture.



>
> * how do we provide solutions that allow users to scale very fine
> grained, e.g. from online to precomputed on a single machine to
> precomputed via Hadoop in the recommender stuff.
>

+1


>
> I think that Mahout is and should always be more than recommenders, but
> that we should be more courageous in throwing out things that are not
> used very much or not maintained very much or don't meet the quality
> standards which we would like to see.
>

Do we have an equivalent of the "attach clothes-pegs to your trousers in
January and throw out anything that still has the peg by end of December" -
that is, can we reliably identify what has not been used by each release?



>
> It is also my personal experience (= I heard it over and over again from
> our users) that it is extremely hard to get started with Mahout using
> the available documentation. MiA is the exception to this, but people
> have to buy it first and it lacks a lot of the latest developments. It
> would be awesome to have a reworked wiki that is qualitatively
> comparable to MiA.
>
>
Strange idea:  What do people think of moving some core documentation out
of the wiki and into the distribution (both as JavaDoc and as a few high
level HTML pages)? Advantage: Documentation is available offline after
downloading the artifact, contributions to the documentation get very
visible which would make active documenters committers, documentation gets
versioned along with the code. (Not sure if moving to Apache CMS could
already help here.)


Isabel

Reply via email to