Pat,

For what it's worth, in many cases the n-grams with the highest llr
scores tend to be kinda cruddy too. For example, here are the top few
from the reuters data set after tokenization in preparation for
k-means clustering.

reuter 3        203110.22877580073
mar 1987        108503.63631130551
apr 1987        51114.50167048405
mln dlrs        47316.0804096169
cts vs  24295.63024604626
jun 1987        22770.96320381516
he said 22630.6717728317

In these cases, both of the terms are high frequency terms. If
stopwords hadn't been removed, we would have seen a large number of
them at the top of the list as well. This leads me to believe that
there's some sort of max llr that would be useful as a cutoff which we
should consider adding to seq2sparse. I suspect such a cutoff would
vary based on corpus vocabulary size.

In the past, I've settled on min thresholds by inspection. I dump the
ngrams generated[1] and their llr scores and inspected the quality of
ngrams at each order of magnitude change in score. Looking at reuters,
I might choose a min llr in the 10's as opposed to the 100's or 1000s.

In general, you should find that there tend to be a ngrams that don't
make sense from a linguistic perspective mixed with others that are
wonderful. Eventually, the crud overwhelms the good and there's your
cutoff. There's likely a more statistically sound approach towards
this, but I haven't found it yet.

And despite the crud It's certainly more effective than indexing all pairs.

[1] e.g:
./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
  -i /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/tokenized-documents
\
  -o /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/colloc -ng
2 -s 2 -ml 0.5
./bin/mahout seqdumper -i
/tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/colloc/ngrams/part-r-00000
\
  | less | grep '^Key' | sed -e 's/Key: //; s/: Value: /\t/' | sort -k
3 -rn > out

Drew

On Sat, Jun 9, 2012 at 6:03 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> OK, thanks. I'm trying to find ways to reduce dimensionality in some
> reasonable way before proceeding to more heavyweight methods.
>
> So my understanding of seq2sparse n-grams seems to be correct. I don't want
> many. Set to 200 I get some nonsensical ones, maybe 2000 is too high, I
> think MiA mentions 1000 as a pretty high value.
>
> As to df pruning, I thought x = 40 meant that if a term appeared in more
> than 40% of the docs it was removed. For my 150,000 page crawl it didn't
> seem like an unreasonable number. If the intuition says differently what
> would be a good number? Maybe I should use maxDFSigma instead - maybe set to
> 3.0 as the help suggests?
>
>
> On 6/9/12 11:39 AM, Robin Anil wrote:
>>
>> ------
>> Robin Anil
>>
>>
>> On Sat, Jun 9, 2012 at 10:27 AM, Pat Ferrel<p...@occamsmachete.com>  wrote:
>>
>>> As I understand it when using seq2sparse with ng = 2 and ml = some large
>>> number. This will never create a vector with less terms than words (all
>>> other pars of the algorithm set aside). In other words ng = 2 and ml =
>>> 2000
>>> will create very few n-grams but will never create a 0 length vector
>>> unless
>>> there are no terms to begin with.
>>>
>>> Is this correct?
>>>
>>> I ask because it looks like many of my n-grams are not really helpful so
>>> I
>>> keep tuning the ml upwards but Robin made a comment that this might cause
>>> 0
>>> length vectors, in which case I might want to stop using n-grams.
>>>
>> You didnt quite get me.
>> I meant ml = minimum log likelihood threshold. an bigram of loglikelihood
>> 1.0 is quite a significant ngram. if you say  ml>  2000, there might not
>> be
>> any ngram that has such a score. Secondly, df pruning of 40% along with ml
>>>
>>> 200 threshold are creating vectors in your dataset devoid of features,
>>> i.e
>>
>> empty vectors.
>>
>

Reply via email to