I agree with your high-level breakdown, good. I wouldn't say I'm in a
hurry yet, just wishing to ask the questions and start to form a plan.
I don't mind if it's just coming into something polished in a year --
would be worried if a year passes and it's not clear where it's going.

I appreciate this is necessarily volunteer work from people's busy
schedule and we can't afford to invest a ton of mental energy as if
this is our full-time startup or something. That said if it's worth
smart people dumping time into, might as well make sure this is on
track to be a world-class reference library.


I'm interested in continuing the conversation you've continued here --


- Is it fair to say that Mahout is basically machine learning stuff,
focused on large scale? particularly, focused on Hadoop? that would be
pretty coherent. Then we probably need some consistent approach to
Hadoop integration, some very step-by-step documentation, and may want
to start reorganizing the code a little to align with this goal.

- So we are specifically not focused on stuff that's not distributed
Hadoop-based jobs?

- The audience is definitely developers it seems. Nobody's trying to
put a GUI on this.

- I don't mind "stuff" in the library really though I do like a clear
sense of what it is, and isn't, that might guide us as to what to keep
and what to remove.

- Would we like to make a goal for 0.3 to present some unified,
designed approach to Hadoop integration?

- Maybe for, say, 0.4 we have a big 'cookbook' and lots of ready-made examples?


... these are the sorts of things, nothing big, I'd love to talk about
sooner than later. Then that guides future work and we can check our
progress against it. Then we have a clear identity to put out there.


On Fri, Sep 4, 2009 at 7:07 PM, Ted Dunning<ted.dunn...@gmail.com> wrote:
> These are good questions to ask.  I don't know that we are ready to answer
> them, but I do think that we have pieces of the answers.
>
> So far, there are three or four general themes that seem to be of real
> interest/value
>
> a) taste/collaborative filtering/cooccurrence analysis
>
> b) facilitation of conventional machine learning by large scale aggregation
> using hadoop (so far, this is largely cooccurrence counting)
>
> c) standard and basic machine learning tasks like clustering, simple
> classifiers running on large scale data
>
> d) stuff
>
> There is definitely pull for something like (a) both in the form of a CF
> library roughly equivalent to lucene.  I know that I have a need for (b) and
> occasionally (c).
>
> It seems reasonable that we can provide a coherent story for (a), (b) and
> (c).  If that is true, then (d) can go along for the ride.
>
> The fact is, however, 99% of the machine learning that I do is quite doable
> in a conventional system like R, although some of that 99% needs (b).  Very
> occasionally I need algorithms to run at large scale, but those systems
> always involve quite a bit of engineering to connect the data fire-hoses
> into the right spigots.  I don't think that my experience all that unusual,
> either.
>
> Do other people share Sean's sense of urgency?
>
> Is my break-down a reasonable one?
>
> On Fri, Sep 4, 2009 at 9:13 AM, Sean Owen <sro...@gmail.com> wrote:
>
>> It may be presumptuous but I volunteer to try to lead answers to these
>> questions. It's going to lead to some tough answers and more work in
>> some cases, no matter who drives it. Hoping to do it sooner than
>> later.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to