It's more an artifact of history than design. When this project kicked off
it was pretty open-ended -- "large scale machine learning". At some early
stage we merged in my (previous, independent) project called Taste, which
was all collaborative filtering and not Hadoop-based. So that's where this
comes from.

And it's also become clear that the new development has centered around
Hadoop and the other abstractions you've seen. So that's why you see a lot
more of that. And this begat the distributed counterparts for collaborative
filtering.

The non-distributed and distributed worlds are fairly different, and I think
they will always be somewhat separate, even if they live under the same
roof.

There are no non-distributed counterparts for clustering and classification.
It's not symmetric, and it would be better if it were, but it's perhaps no
big sin. The benefits have outweighed the negatives, I think. Of course
nobody would mind developing a scalable non-distributed clusterer /
classifier -- but in all things, it's a question of whether anyone cares
enough to write it and maintain it.


(The book purposely led off with recommenders since you can talk about the
more simple non-distributed setup first, then segue into distributed, then
into clustering / classification.)

I don't think it's perfect but I think it's OK. I could sure name some other
things I'd fix before this!



On Tue, Aug 16, 2011 at 7:04 PM, Jeff Hansen <dsche...@gmail.com> wrote:

> When I first started reading the Manning book, I was a little surprised by
> the description of data structures for preferences in the collaborative
> filtering section.  Before getting the book I had really only played around
> with the Vector implementations and I was used to the Vectors being generic
> lists of <int, double> pairs.  So I was a little bit surprised to read the
> description of all the collaborative filtering implementations using
> generic
> lists of <long, float> pairs.
>
> I was wondering if I could get some general comments on the reason for this
> disparity.  I'm guessing it's a matter of history and optimization -- taste
> was optimized for storing more info at the index level and less at the
> "rating" level whereas vectors were intended to be generic with the ability
> to maintain the maximum amount of precision.  Unfortunately the lowest
> common denominator is int/float, so if you want to go between models you
> have to fit into the smaller footprint constraint of each without getting
> the benefit of the smaller footprint constraint of each...
>
> It ends up feeling like there are two faces to mahout which are somewhat
> incompatible.  Are there any thoughts about bridging the gap between the
> two
> models in the future?  If this really is a matter of each model being
> optimized for it's problem space, maybe it would just help to have a clear
> delineation of which utilities belong on which side of the fence -- as well
> as some utility for shifting generic types between the models (with the
> warning that there might be loss of precision or the ability to maintain as
> many ids).  That way utilities that already exist on the one side could be
> reused on the other side.
>

Reply via email to