Hi,
I've been doing some reading through the archives to search for some
inspiration with a problem I've been attempting to solve at work, and
was hoping I could share where my head's at and get some pointers for
where to go. Since we're looking at clustering between 17m and 70m
documents, we're looking to implement this in Hadoop.
We're trying to build clusters of (relatively small) documents, based
on the words/terms they contain. Clusters will probably range in size
between 10 and 1000 documents. Clusters should ultimately be populated
by documents that contain almost identical terms (nothing clever like
stemming done so far).
So far, I've been working down the pairwise similarity route. So,
using a few MapReduce jobs we produce something along the lines of the
following:
Da: [Db,0.1] [Dc,0.4]
Db: [Dc,0.5] [Df,0.9]
...
Dj:
With a row per-document, containing a vector of tuples for related
documents, and a similarity score. Viewed another way, it's the
typical matrix:
A B C
-----+------+------+-----+
A | | 0.1 | 0.4
B | | | 0.5
etc. A higher number means a more closely related document.
I've been trying to get my head around how to cluster these in a set
of MapReduce jobs and I'm not quite sure of how to proceed: the
examples I've read around kmeans, canopy clustering etc. all seem to
work on multidimensional (numerical) data. Given the data above, is it
even possible to adapt the algorithm? The potential centroids in the
example above would be the documents, and I just can't get my head
around applying the algorithms to this kind of data.
I guess the alternative would be to step back to producing a matrix of
terms x documents:
Ta Tb Tc
-----+------+------+-----+
Da | | 0.1 | 0.4
Db | | | 0.5
And then cluster based on this? This seems similar in structure to the
user x movie recommendation matrix that's often used?
I'd really appreciate people's thoughts- am I thinking the right way?
Is there a better way?
Thanks,
Paul