Re: Document Clustering

Grant Ingersoll Thu, 28 May 2009 17:17:29 -0700


On May 28, 2009, at 7:25 PM, Ted Dunning wrote:

I don't think that the clustering stuff will do the tf-idf weightingor


Right, that I wasn't expecting from Mahout (yet)


cosine norm.


o.a.mahout.utils.CosineDistanceMeasure?



From there, any of the clustering algorithms should be happy.


Cool.  Whew.

(that is, what you said is just right)
On Thu, May 28, 2009 at 12:21 PM, Grant Ingersoll<[email protected]>wrote:
Isn't this what Mahout's clustering stuff will do? In other words,if Icalculate the vector for each document (presumably removingstopwords),normalize it, where each cell is the weight (presumably TF/IDF) andthen putthat into a matrix (keeping track of labels), I should then be ableto justrun any of Mahout's clustering jobs on that matrix using theappropriate
DistanceMeasure implementation, right?  Or am I missing something?


On May 28, 2009, at 11:55 AM, Ted Dunning wrote:

Generally the first step for document clustering is to compute all
non-trivial document-document similarities. A good way to do thatis to
strip out kill words from all documents and then do a document level
cross-occurence. In database terms, if we think of documents asdocid,
term
pairs, this step consists of joining this document table to itselfto getdocument-document pairs for all documents that share terms. Indetail,
starting with a term weight table and a document table:

  - join term weight to document table to get (docid, term, weight)*
- optionally normalize term weights per document by summingweights orsquared weights by docid and joining back to the weighted documenttable.
- join result to itself dropping terms and reducing on docid tosum
weights.  This gives  (docid1, docid2, sum_of_weights,
number_of_occurrences).  This sum can be weights or squared weights.
Accumulating the number of coocurrences helps in computing theaverage.
From here, there are a number of places to go, but the result wehave here
is essentially a sparse similarity matrix.  If you have document
normalization, then document similarity can be converted to distance
trivially.
On Thu, May 28, 2009 at 8:28 AM, Grant Ingersoll<[email protected]
wrote:
It sounds like a start.  Can you open a JIRA and attach a patch?   I
still
am not sure if Lucene is totally the way to go on it.  I suppose
eventually
we need a way to put things in a common format like ARFF and thenjust
have
transformers to it from other formats. Come to think of it,maybe it
makes
sense to have a Tika ContentHandler that can output ARFF orwhatever
other
format we want. This would make translating input docs deadsimple.
Then again, maybe a real Pipeline is the answer. I know Solr,etc. could
benefit from one too, but that is a whole different ball of wax.



On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:

Hi Grant,
I have the code to create lucene index from document text and then
generate document vectors from it. This is stand-alone code andnot
MR.  Is it something that interests you?

--shashi
On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <[email protected]>
wrote:
I'm about to write some code to prepare docs for clustering andI know
at
least a few others on the list here have done the same.  I was
wondering
if
anyone is in the position to share their code and contribute toMahout.
As I see it, we need to be able to take in text and create thematrix
of
terms, where each cell is the TF/IDF (or some other weight,would be
nice
to
be pluggable) and then normalize the vector (and, according toTed, weshould support using different norms). Seems like we alsoneed the
label
stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65) but
I'm
not
sure on the state of that patch.
As for the TF/IDF stuff, we sort of have it via theBayesTfIdfDriver,
but
it
needs to be a more generic. I realize we could use Lucene, buthaving
a
solution that scales w/ Lucene is going to take work, AIUI,whereas a
M/R
job seems more straightforward.
I'd like to be able to get this stuff committed relatively soonand
have
the
examples for other people. My shorter term goal is I'm workingon some
demos using Wikipedia.

Thanks,
Grant

Re: Document Clustering

Reply via email to