Hi,

I've been doing some reading through the archives to search for some inspiration with a problem I've been attempting to solve at work, and was hoping I could share where my head's at and get some pointers for where to go. Since we're looking at clustering between 17m and 70m documents, we're looking to implement this in Hadoop.

We're trying to build clusters of (relatively small) documents, based on the words/terms they contain. Clusters will probably range in size between 10 and 1000 documents. Clusters should ultimately be populated by documents that contain almost identical terms (nothing clever like stemming done so far).

So far, I've been working down the pairwise similarity route. So, using a few MapReduce jobs we produce something along the lines of the following:

Da: [Db,0.1] [Dc,0.4]
Db: [Dc,0.5] [Df,0.9]
...
Dj:

With a row per-document, containing a vector of tuples for related documents, and a similarity score. Viewed another way, it's the typical matrix:

               A        B       C
-----+------+------+-----+
A    |          |   0.1 |   0.4
B    |          |          |   0.5

etc. A higher number means a more closely related document.

I've been trying to get my head around how to cluster these in a set of MapReduce jobs and I'm not quite sure of how to proceed: the examples I've read around kmeans, canopy clustering etc. all seem to work on multidimensional (numerical) data. Given the data above, is it even possible to adapt the algorithm? The potential centroids in the example above would be the documents, and I just can't get my head around applying the algorithms to this kind of data.

I guess the alternative would be to step back to producing a matrix of terms x documents:

            Ta     Tb     Tc
-----+------+------+-----+
Da  |          |   0.1 |   0.4
Db  |          |          |   0.5

And then cluster based on this? This seems similar in structure to the user x movie recommendation matrix that's often used?

I'd really appreciate people's thoughts- am I thinking the right way? Is there a better way?

Thanks,
Paul

Reply via email to