Parts-of-speech tagging lets you spot word pairs that are interesting.
For example, you can choose only "noun-verb" ngrams. I'm adding
OpenNLP to another project (Lucene) and the Parts-of-speech seems to
work well.

On Mon, Jun 11, 2012 at 7:31 PM, Drew Farris <d...@apache.org> wrote:
> Pat,
>
> For what it's worth, in many cases the n-grams with the highest llr
> scores tend to be kinda cruddy too. For example, here are the top few
> from the reuters data set after tokenization in preparation for
> k-means clustering.
>
> reuter 3        203110.22877580073
> mar 1987        108503.63631130551
> apr 1987        51114.50167048405
> mln dlrs        47316.0804096169
> cts vs  24295.63024604626
> jun 1987        22770.96320381516
> he said 22630.6717728317
>
> In these cases, both of the terms are high frequency terms. If
> stopwords hadn't been removed, we would have seen a large number of
> them at the top of the list as well. This leads me to believe that
> there's some sort of max llr that would be useful as a cutoff which we
> should consider adding to seq2sparse. I suspect such a cutoff would
> vary based on corpus vocabulary size.
>
> In the past, I've settled on min thresholds by inspection. I dump the
> ngrams generated[1] and their llr scores and inspected the quality of
> ngrams at each order of magnitude change in score. Looking at reuters,
> I might choose a min llr in the 10's as opposed to the 100's or 1000s.
>
> In general, you should find that there tend to be a ngrams that don't
> make sense from a linguistic perspective mixed with others that are
> wonderful. Eventually, the crud overwhelms the good and there's your
> cutoff. There's likely a more statistically sound approach towards
> this, but I haven't found it yet.
>
> And despite the crud It's certainly more effective than indexing all pairs.
>
> [1] e.g:
> ./bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
>  -i /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/tokenized-documents
> \
>  -o /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/colloc -ng
> 2 -s 2 -ml 0.5
> ./bin/mahout seqdumper -i
> /tmp/mahout-work-drew/reuters-out-seqdir-sparse-kmeans/colloc/ngrams/part-r-00000
> \
>  | less | grep '^Key' | sed -e 's/Key: //; s/: Value: /\t/' | sort -k
> 3 -rn > out
>
> Drew
>
> On Sat, Jun 9, 2012 at 6:03 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> OK, thanks. I'm trying to find ways to reduce dimensionality in some
>> reasonable way before proceeding to more heavyweight methods.
>>
>> So my understanding of seq2sparse n-grams seems to be correct. I don't want
>> many. Set to 200 I get some nonsensical ones, maybe 2000 is too high, I
>> think MiA mentions 1000 as a pretty high value.
>>
>> As to df pruning, I thought x = 40 meant that if a term appeared in more
>> than 40% of the docs it was removed. For my 150,000 page crawl it didn't
>> seem like an unreasonable number. If the intuition says differently what
>> would be a good number? Maybe I should use maxDFSigma instead - maybe set to
>> 3.0 as the help suggests?
>>
>>
>> On 6/9/12 11:39 AM, Robin Anil wrote:
>>>
>>> ------
>>> Robin Anil
>>>
>>>
>>> On Sat, Jun 9, 2012 at 10:27 AM, Pat Ferrel<p...@occamsmachete.com>  wrote:
>>>
>>>> As I understand it when using seq2sparse with ng = 2 and ml = some large
>>>> number. This will never create a vector with less terms than words (all
>>>> other pars of the algorithm set aside). In other words ng = 2 and ml =
>>>> 2000
>>>> will create very few n-grams but will never create a 0 length vector
>>>> unless
>>>> there are no terms to begin with.
>>>>
>>>> Is this correct?
>>>>
>>>> I ask because it looks like many of my n-grams are not really helpful so
>>>> I
>>>> keep tuning the ml upwards but Robin made a comment that this might cause
>>>> 0
>>>> length vectors, in which case I might want to stop using n-grams.
>>>>
>>> You didnt quite get me.
>>> I meant ml = minimum log likelihood threshold. an bigram of loglikelihood
>>> 1.0 is quite a significant ngram. if you say  ml>  2000, there might not
>>> be
>>> any ngram that has such a score. Secondly, df pruning of 40% along with ml
>>>>
>>>> 200 threshold are creating vectors in your dataset devoid of features,
>>>> i.e
>>>
>>> empty vectors.
>>>
>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to