Re: Call to action – Mahout needs your help

Dmitriy Lyubimov Mon, 25 Mar 2013 09:21:03 -0700

On Mar 25, 2013 8:36 AM, "Grant Ingersoll" <gsing...@apache.org> wrote:
>
>
> On Mar 25, 2013, at 4:10 AM, Sebastian Schelter wrote:
>
> > Hi,
> >
> > throwing in my 2 cents here:
> >
> > I think that you mentioned a very good point with stating that it is not
> > clear whether Mahout is a library, a standalone program to interact with
> > via the command line. IMO, its first and foremost a library (similar to
> > Lucene), and this should also be reflected in the codebase.
>
> That is my view as well and I think we have been moderately successful at
it.
>
> >
> > I don't agree that we simply lack manpower but have a clear vision. I
> > actually think its the other way round. I think Mahout is kind of stuck,
> > because it does not have a clear vision. I think we faced and still face
> > very hard challenges, as we have to provide answers for the following
> > questions:
> >
> > * for which problems and algorithms does it really make sense to use
> > MapReduce?
>
> My test is simply whether someone has implemented it or not.  I don't
think we have to have a line in the sand.


It is in fact very easy to test. (Imo). Most of the complaints are
revolving around highly iterative methods. It is sufficient to estimate
startup and interstep persistence costs per required no of iterations and
that would give overhead no.1. E.g. popular stationary pagerankish
distribution related methods fall into this category as well as iterative
bootstrapish search techniques such as search for optimum fit in
regularized als.

Slightly more subtle overhead no.2 in my experience stems from forced sort
required for grouping of anything (especially i think in things such as
matrix matrix multiplication) and perhaps to much lesser degree, what
people mentioned, lack of scatter operator.



 A working, tested, demonstrable implementation beats the one that isn't,
regardless of which approach it uses, so I don't think we have to decide up
front but instead look at it on a case by case basis.  At the end of the
day, those who do the work get to decide.
>
> >
> > * how broad can the spectrum of things that we offer be without a
> > decline in quality?
> >
> > * how do we deal with the fact that our codebase is split up into a
> > collection of algorithms with very few people being able to work on all
> > of them, due to the required theoretical background and the complexity
> > of efficient code
> >
> > * how do we provide solutions that allow users to scale very fine
> > grained, e.g. from online to precomputed on a single machine to
> > precomputed via Hadoop in the recommender stuff.
>
> I don't see these as vision issues, I see them as implementation issues.
 Regardless, it doesn't matter which category they fall under, as they are
the important issues we face.
>
> As for the complexity issue, I don't know that we ever solve it, we just
need to identify contributors in those areas quickly, mentor them, and make
them committers as soon as they are ready.
>
>
>
> >
> > I think that Mahout is and should always be more than recommenders, but
> > that we should be more courageous in throwing out things that are not
> > used very much or not maintained very much or don't meet the quality
> > standards which we would like to see.
>
> +1.  I think we have gotten a lot better at this, thanks to Sean, you and
others.
>
> >
> > It is also my personal experience (= I heard it over and over again from
> > our users) that it is extremely hard to get started with Mahout using
> > the available documentation. MiA is the exception to this, but people
> > have to buy it first and it lacks a lot of the latest developments. It
> > would be awesome to have a reworked wiki that is qualitatively
> > comparable to MiA.
> >
>
> Good docs are always hard.  Whatever reduces barriers, the better.  Going
w/ the Github model, there's a lot to be said for Javadocs and/or Markdown
right in the code base, but neither solves the developer inertia of
actually writing them.
>
>
> > Best,
> > Sebastian
> >
> > On 25.03.2013 07:29, Isabel Drost-Fromm wrote:
> >>
> >>
> >> On Monday, March 25, 2013 07:22:46 AM Isabel Drost-Fromm wrote:
> >>> On Sunday, March 24, 2013 05:38:00 PM Grant Ingersoll wrote:
> >>>> On Mar 24, 2013, at 5:03 PM, Isabel Drost-Fromm wrote:
> >>>>> What about an experiment: If you (reading this mail) were to write
a two
> >>>>> sentence vision statement for Mahout as you see it - what would
that be?
> >>>>
> >>>> Produce open source, scalable machine learning code using a community
> >>>> development model.
> >>>
> >>> So taking that apart:
> >>>
> >>> - Hadoop is not necessarily part of the equation. All that we promise
are
> >>> implemenations that are reasonably scalable.
> >>
> >> - We play well with small-ish (fits in memory) and large (fits only in
memory of
> >> many machines) or huge (fits only on disk) datasets.
> >>
> >>> - There is no restriction in there wrt. supporting only specific use
cases -
> >>> in particular no restriction to be recommendations only.
> >>>
> >>> - There is no restriction to "only batch" or "only online" learning.
> >>>
> >>> If we want to be that broad we definitely lack lots of people, I
think.
> >>>
> >>> The other question that I cannot answer today: Do we want to be a Java
> >>> Library that people link with their project, a standalone program that
> >>> people interact with via the command line, a basis that people can
easily
> >>> integrate into their
Pig/Hive/Cascalog/Scalding/Cascading/what-ever-else
> >>> workflows or all of these?
> >>
> >>
> >
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>

Re: Call to action – Mahout needs your help

Reply via email to