That kind of clustering is not what would answer the important question for
me.

It sounds like the important question is one of two things:

a) what drives the viewing of the page (using external evidence only and not
estimating any kind of user model)

b) can we estimate some sort of mental state that leads to viewing (using
evidence to drive a user model which drives new evidence)

The first approach has the virtue of great simplicity.  To predict a view,
you can find anomalously coocurrent urls (coocurrent in the sense that they
were viewed by the some same user).  Then you can use the anomalously
coocurrent urls to build a viewing model, probably using something like
ridged logistic regression or simply using IDF weighting.  I don't think
that NaiveBayes is as good for these models as people think.  Anomalous
coocurrence is not something well detected by Jaccard or cousins, but it is
well detected by log-likelihood ratio tests based on 2x2 contingency
tables.  Simply using anomalous coocurrent URL's makes for a very good
recommendation engine if you have good data rates.

The second approach has the virtue that you can define a nice metric for
urls.  First you take the likelihood that two urls will both be visited by a
particular user, then you average over the distribution of all users.  This
gives you an overall coocurrence probability of two urls, the negative log
of which is likely to be informative if interpreted as a distance.  Building
the user model gives you the virtue of smoothing which is where Jaccard
falls apart.  For low data rates, this is likely to give you the best
results.

Neither of these approaches involves clustering because clustering is not
the goal.  If you really want the clustering as an end in itself, I think
what you should restate the coordinate space before clustering.  A good way
to do that is to build a user model that is a simple mixture, then the
mixture coefficients for each document make good coordinates for normal
clustering algorithms (because dot products have such a natural
interpretation which implies that Euclidean distance is a useful measure).
In fact, the dot product for the mixture model is exactly the "average over
all users" idea from above for the special case of a mixture model.  Other
user models might not be so simple.  Jeff Eastman has done a lot of work on
building such a mixture model estimator for mahout.

I have had cases where clustering and model building were essentially unable
to interpret the data at all in the original data space, but were able to
extract enormous amounts of useful information when the data was restated in
terms of mixture model coefficients.


On Tue, Jan 13, 2009 at 4:33 AM, Ankur (JIRA) <[email protected]> wrote:

> I would like to cluster them in a tree and use the model to answer the near
> neighborhood type queries i.e. what urls are related to what other urls. I
> did implement a sequential bottom-up hierarchical clustering algorithm but
> the complexity is too bad for my data-set. I then thought about implementing
> a top-down hierarchical clustering algorithm using Jaccard co-efficient as
> my distance measure and came across this patch.
>

Reply via email to