These are good questions to ask.  I don't know that we are ready to answer
them, but I do think that we have pieces of the answers.

So far, there are three or four general themes that seem to be of real
interest/value

a) taste/collaborative filtering/cooccurrence analysis

b) facilitation of conventional machine learning by large scale aggregation
using hadoop (so far, this is largely cooccurrence counting)

c) standard and basic machine learning tasks like clustering, simple
classifiers running on large scale data

d) stuff

There is definitely pull for something like (a) both in the form of a CF
library roughly equivalent to lucene.  I know that I have a need for (b) and
occasionally (c).

It seems reasonable that we can provide a coherent story for (a), (b) and
(c).  If that is true, then (d) can go along for the ride.

The fact is, however, 99% of the machine learning that I do is quite doable
in a conventional system like R, although some of that 99% needs (b).  Very
occasionally I need algorithms to run at large scale, but those systems
always involve quite a bit of engineering to connect the data fire-hoses
into the right spigots.  I don't think that my experience all that unusual,
either.

Do other people share Sean's sense of urgency?

Is my break-down a reasonable one?

On Fri, Sep 4, 2009 at 9:13 AM, Sean Owen <sro...@gmail.com> wrote:

> It may be presumptuous but I volunteer to try to lead answers to these
> questions. It's going to lead to some tough answers and more work in
> some cases, no matter who drives it. Hoping to do it sooner than
> later.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to