It's more an artifact of history than design. When this project kicked off it was pretty open-ended -- "large scale machine learning". At some early stage we merged in my (previous, independent) project called Taste, which was all collaborative filtering and not Hadoop-based. So that's where this comes from.
And it's also become clear that the new development has centered around Hadoop and the other abstractions you've seen. So that's why you see a lot more of that. And this begat the distributed counterparts for collaborative filtering. The non-distributed and distributed worlds are fairly different, and I think they will always be somewhat separate, even if they live under the same roof. There are no non-distributed counterparts for clustering and classification. It's not symmetric, and it would be better if it were, but it's perhaps no big sin. The benefits have outweighed the negatives, I think. Of course nobody would mind developing a scalable non-distributed clusterer / classifier -- but in all things, it's a question of whether anyone cares enough to write it and maintain it. (The book purposely led off with recommenders since you can talk about the more simple non-distributed setup first, then segue into distributed, then into clustering / classification.) I don't think it's perfect but I think it's OK. I could sure name some other things I'd fix before this! On Tue, Aug 16, 2011 at 7:04 PM, Jeff Hansen <dsche...@gmail.com> wrote: > When I first started reading the Manning book, I was a little surprised by > the description of data structures for preferences in the collaborative > filtering section. Before getting the book I had really only played around > with the Vector implementations and I was used to the Vectors being generic > lists of <int, double> pairs. So I was a little bit surprised to read the > description of all the collaborative filtering implementations using > generic > lists of <long, float> pairs. > > I was wondering if I could get some general comments on the reason for this > disparity. I'm guessing it's a matter of history and optimization -- taste > was optimized for storing more info at the index level and less at the > "rating" level whereas vectors were intended to be generic with the ability > to maintain the maximum amount of precision. Unfortunately the lowest > common denominator is int/float, so if you want to go between models you > have to fit into the smaller footprint constraint of each without getting > the benefit of the smaller footprint constraint of each... > > It ends up feeling like there are two faces to mahout which are somewhat > incompatible. Are there any thoughts about bridging the gap between the > two > models in the future? If this really is a matter of each model being > optimized for it's problem space, maybe it would just help to have a clear > delineation of which utilities belong on which side of the fence -- as well > as some utility for shifting generic types between the models (with the > warning that there might be loss of precision or the ability to maintain as > many ids). That way utilities that already exist on the one side could be > reused on the other side. >