Nice trick, I'll give it a try. N-grams (and I argue even the IDF part of TFIDF) have frequency scores highly dependent on the corpus so eyeballing the results seems like a good idea. The wider ranging the subject matter, the lower the mll scores will be in general.

In my case the corpus isn't truly large but is quite diverse, nothing like the reuter's data. So applying mll rules of thumb may not work.

On 6/11/12 11:11 PM, Ted Dunning wrote:
Actually, most of these bigrams are quite plausible for clustering.  I wouldn't 
worry too much about the linguistic plausibility of these pairs. The question 
is scoring performance.

Sent from my iPhone

On Jun 11, 2012, at 7:31 PM, Drew Farris<d...@apache.org>  wrote:

Pat,

For what it's worth, in many cases the n-grams with the highest llr
scores tend to be kinda cruddy too. For example, here are the top few
from the reuters data set after tokenization in preparation for
k-means clustering.

reuter 3        203110.22877580073
mar 1987        108503.63631130551
apr 1987        51114.50167048405
mln dlrs        47316.0804096169
cts vs  24295.63024604626
jun 1987        22770.96320381516
he said 22630.6717728317

In these cases, both of the terms are high frequency terms. If
stopwords hadn't been removed, we would have seen a large number of
them at the top of the list as well. This leads me to believe that
there's some sort of max llr that would be useful as a cutoff which we
should consider adding to seq2sparse. I suspect such a cutoff would
vary based on corpus vocabulary size.

In the past, I've settled on min thresholds by inspection. I dump the
ngrams generated[1] and their llr scores and inspected the quality of
ngrams at each order of magnitude change in score. Looking at reuters,
I might choose a min llr in the 10's as opposed to the 100's or 1000s.

In general, you should find that there tend to be a ngrams that don't
make sense from a linguistic perspective mixed with others that are
wonderful. Eventually, the crud overwhelms the good and there's your
cutoff. There's likely a more statistically sound approach towards
this, but I haven't found it yet.

And despite the crud It's certainly more effective than indexing all pairs.

[1] e.g:
./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
  -i /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/tokenized-documents
\
  -o /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/colloc -ng
2 -s 2 -ml 0.5
./bin/mahout seqdumper -i
/tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/colloc/ngrams/part-r-00000
\
  | less | grep '^Key' | sed -e 's/Key: //; s/: Value: /\t/' | sort -k
3 -rn>  out

Drew

On Sat, Jun 9, 2012 at 6:03 PM, Pat Ferrel<p...@occamsmachete.com>  wrote:
OK, thanks. I'm trying to find ways to reduce dimensionality in some
reasonable way before proceeding to more heavyweight methods.

So my understanding of seq2sparse n-grams seems to be correct. I don't want
many. Set to 200 I get some nonsensical ones, maybe 2000 is too high, I
think MiA mentions 1000 as a pretty high value.

As to df pruning, I thought x = 40 meant that if a term appeared in more
than 40% of the docs it was removed. For my 150,000 page crawl it didn't
seem like an unreasonable number. If the intuition says differently what
would be a good number? Maybe I should use maxDFSigma instead - maybe set to
3.0 as the help suggests?


On 6/9/12 11:39 AM, Robin Anil wrote:
------
Robin Anil


On Sat, Jun 9, 2012 at 10:27 AM, Pat Ferrel<p...@occamsmachete.com>  wrote:

As I understand it when using seq2sparse with ng = 2 and ml = some large
number. This will never create a vector with less terms than words (all
other pars of the algorithm set aside). In other words ng = 2 and ml =
2000
will create very few n-grams but will never create a 0 length vector
unless
there are no terms to begin with.

Is this correct?

I ask because it looks like many of my n-grams are not really helpful so
I
keep tuning the ml upwards but Robin made a comment that this might cause
0
length vectors, in which case I might want to stop using n-grams.

You didnt quite get me.
I meant ml = minimum log likelihood threshold. an bigram of loglikelihood
1.0 is quite a significant ngram. if you say  ml>   2000, there might not
be
any ngram that has such a score. Secondly, df pruning of 40% along with ml
200 threshold are creating vectors in your dataset devoid of features,
i.e
empty vectors.


Reply via email to