Document Clustering

Paul Ingles Tue, 07 Jul 2009 14:38:01 -0700

Hi,

I've been doing some reading through the archives to search for someinspiration with a problem I've been attempting to solve at work, andwas hoping I could share where my head's at and get some pointers forwhere to go. Since we're looking at clustering between 17m and 70mdocuments, we're looking to implement this in Hadoop.

We're trying to build clusters of (relatively small) documents, basedon the words/terms they contain. Clusters will probably range in sizebetween 10 and 1000 documents. Clusters should ultimately be populatedby documents that contain almost identical terms (nothing clever likestemming done so far).

So far, I've been working down the pairwise similarity route. So,using a few MapReduce jobs we produce something along the lines of thefollowing:


Da: [Db,0.1] [Dc,0.4]
Db: [Dc,0.5] [Df,0.9]
...
Dj:

With a row per-document, containing a vector of tuples for relateddocuments, and a similarity score. Viewed another way, it's thetypical matrix:


               A        B       C
-----+------+------+-----+
A    |          |   0.1 |   0.4
B    |          |          |   0.5

etc. A higher number means a more closely related document.

I've been trying to get my head around how to cluster these in a setof MapReduce jobs and I'm not quite sure of how to proceed: theexamples I've read around kmeans, canopy clustering etc. all seem towork on multidimensional (numerical) data. Given the data above, is iteven possible to adapt the algorithm? The potential centroids in theexample above would be the documents, and I just can't get my headaround applying the algorithms to this kind of data.

I guess the alternative would be to step back to producing a matrix ofterms x documents:


            Ta     Tb     Tc
-----+------+------+-----+
Da  |          |   0.1 |   0.4
Db  |          |          |   0.5

And then cluster based on this? This seems similar in structure to theuser x movie recommendation matrix that's often used?

I'd really appreciate people's thoughts- am I thinking the right way?Is there a better way?


Thanks,
Paul

Document Clustering

Reply via email to