Hi. Been visiting some mailinglists trying to find directions for howto find address a quite cool problem. The answer have so far been. "Look at K-Means clustering". Since I'm quite familiar with Hadoop which is incorporated in both our crawler and into our webstats engine it seemed that the right gang to turn to was the Mahout users.
Basically we have 40k sites in our network of which we track weblogstats like unique browsers, page impressions and sessions etc. We want to learn more about our network so I've started to develop a solution which would create a similarity matrix so you could say: These 10 sites are most similar to Site X in terms of visiting patterns i.e. same kind of audience. We have one big problem though... It will take 5 years to compute this matrix at the current implementation speed :) That's why I'm starting to look elsewhere. The matrix is really simple and below is an example site1 site2 site3.... uid1 X X uid2 X X uid3 X .... or table wise could be something like CREATE TABLE UniqueSiteVisitorSample( s1 bit, s2 bit, s3 bit, .... uid bigint ) Where the X (bit set) means that one visitor visited a specific site, so two sites with many common "X's" is similar... As I said _very_ simple datastructure but large and I don't know of any storage mechanism where you could store 40k+ columns... Would it be feasible to use Mahout to create some output which stated how similar SiteX is with SiteA,SiteB etc ? If it takes some hours (or days) to compute then it's quite OK from my standpoint since the matrix could be recreated every X weeks or so. Hope anyone here at the list thinks, "Man this guy is stupid, he should do it like this!":) /Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [email protected] http://www.tailsweep.com/ http://blogg.tailsweep.com/
