Thanks Jake! It is much much better now :)

I think I can do that pipeline now. About filtering (for me it is all the
same if I have to filter unigrams or n-grams - the filtering is key part) -
I built a small web interface (tomcat, myfaces, mbeans, jsp, etc.) that
shows the clusters (w/ list of topterms) and each topterm is a link - where
if you press that link the term is added to the Solr stopwords.txt file, I
can also define regex in that interface so that it gets stored into a
stopregex.txt (which is of course used only by the solr/lucene
index-to-sparse vector Driver which I had to change, but not by Solr
itself). I can then trigger k-means or clusterDump (with different
parameters - like number of topterms to be displayed) from within the same
web ui. I also have a page which shows all the docs in a cluster (I am
extracting the original docs per cluster) and highlighting the topterms in
them, in different colors. - all this just to be able to better train the
stopwording.
I wanted to say that I found this stop-words, stop-regex so important that I
had to build this web ui - to have a easy to use stop word trainer. I am
even considering to add some stops based even on Lucene fields - e.g. I want
to stop entire docs if certain conditions are met. (I am analyzing emails
and you can imagine - some emails should be stopped from clustering already
based on their sender or subject, etc. some basic spam filtering)

You also mentioned that you are doing stopwording in Mahout?
In which version and where is that? I added stopwords to the Solr
configuration and it worked like a charm for Solr itself but never worked
out for Mahout (lucene vector driver) so I had to change that Driver to
apply the stopwords and regex. Did I do something in parallel? I am using
Mahout v.0.2 (not trunk yet).

Best regards,
Bogdan

On Thu, Jan 7, 2010 at 4:24 AM, Jake Mannix <[email protected]> wrote:

> Hi Bogdan,
>
>  Sorry about that, I kinda jumped into implementation details related to
> what
> we've been discussing on the mahout-dev list - much of what I described we
> don't
> yet have in mahout, and I was talking about some of the ideas on how we can
> incorporate it in!
>
> On Wed, Jan 6, 2010 at 5:53 PM, Bogdan Vatkov <[email protected]
> >wrote:
>
> > Ok to be hones I do not get all of this - still a newbie in this matter.
> > Have some clarification questions:
> >
> > > The way I've done this is to take whatever unigram analyzer for
> >
> > tokenization that fits what you want to do, wrap it in Lucene's
> >
> > ShingleAnalyzer, and use that as the "tokenizer" (which now
> >
> > produces ngram tokens as single tokens each),
> >
> >
> > you mean to use the unigram analyzer only to feed the Shingle Analyzer
> (as
> > part of lucene?), so the first one is not nativelly connected to lucene
> > directly but only produces input for lucene so to say?
> >
>
> Mahout has lucene jars pulled in, and we use them for things like analyzing
> /
> tokenizing text.  For example, included in the lucene jars we link to,
> there
> is
> the StandardAnalyzer, which does tokenization, lower-casing, stop word
> removal, and some cleanup.  It takes in your text and spits out a stream of
> tokens, which for most Analyzer implementations, are unigrams.  Also
> available in lucene, is the ShingleAnalyzer, and the Analyzer class itself
> allows for chaining, so you can do things like
>
>  Analyzer myAnalyzer = new ShingleAnalyzer(3, new StandardAnalyzer());
>
> (where "3" is just saying you only want up to 3-grams)
>
> What you have at this point is an Analyzer instance which takes in raw
> Reader instances (like StringReader), and spits out a stream of ngrams.
>
>
> >
> >
> > > and run
> > > that
> > > through the LLR ngram M/R job (which ends by sorting descending by LLR
> > > score),
> > > and shove the top-K ngrams (and sometimes the unigrams which fit some
> > > "good"
> > > IDF range) into a big bloom filter, which is serialized and saved.
> > >
> >
> > what is IDF and LLR
>
>
> IDF is "Inverse Document Frequency", a weighting scheme used in lucene and
> elsewhere in IR and NLP which helps boost the effective weight of terms
> which
> are more rare (numerically, the IDF of a term which occurs in k out of N
> documents is usually taken to be something logarithmic like log( N / k ) ).
>
> LLR is the "Log-Likelihood Ratio", which for token n-grams, is a measure of
> how statistically significant each n-gram is (ie. if tokens "A" and "B"
> occur
> with frequencies f(A) and f(B), the LLR of the bigram "A B" is a measure of
> how much more frequently "A B" occurs relative to how much they
> frequently they would occur if they were uncorrelated).  In some ways you
> can look at IDF as LLR for unigrams.
>
>
> > do we have that in Mahout? I am only using k-means
> > algorithm from Mahout. How would you relate these? I lack some
> terminology
> > though.
> >
>
> What I described is a pipeline taking text, tokenizing and finding the top
> statistically significant ngrams (you don't want all of them - most are
> either
> fairly rare, or are just the combination of a common unigram and another
> unigram), and then making Mahout SparseVector instances out of the
> documents, which you can then feed to k-means.
>
> and I could possibly do that one - write some small algorithm which finds
> > n-grams and feeds the initial term vectors before k-means task.
> > Is what you explained close to what Ted suggested or are they different
> > approaches?
> >
>
> I was basically describing what Ted was talking about, just in a bit more
> intricate detail.  Using a simple ShingleAnalyzer wrapped around a
> StandardAnalyzer from lucene and you've got your "little algorithm"
> to pull out ngrams which can be fed to k-means.  You will just get a lot
> of noise if you don't do a little bit of work to make sure you only have
> the "good" ngrams.
>
> I would really like to have this whole pipeline in Mahout, in a totally
> user-friendly way, but while the pieces exist, the whole pipleline hasn't
> been put together quite yet (although you could certainly write it
> yourself - we love patches! :) )
>
>  -jake
>



-- 
Best regards,
Bogdan

Reply via email to