I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
'verb', etc. I removed all words that were not nouns or verbs. In my
use case, this is a total win. In other cases, maybe not: Twitter has
a quite varied non-grammer.

On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <p...@farfetchers.com> wrote:
> The way back from stem to tag is interesting from the standpoint of making 
> tags human readable. I had assumed a lookup but this seems much more 
> satisfying and flexible. In order to keep frequencies it will take something 
> like a dictionary creation step in the analyzer. This in turn seems to imply 
> a join so a whole new map reduce job--maybe not completely trivial?
>
> It seems that NLP can be used in two very different ways here. First as a 
> filter (keep only nouns and verbs?) second to differentiate semantics 
> (can:verb, can:noun). One method is a dimensional reduction technique the 
> other increases dimensions but can lead to orthogonal dimensions from the 
> same term. I suppose both could be used together as the above example 
> indicates.
>
> It sounds like you are using it to filter (only?) Can you explain what you 
> mean by:
> "One thing came through- parts-of-speech selection for nouns and verbs
> helped 5-10% in every combination of regularizers.'
>
>
> On Aug 3, 2012, at 6:31 PM, Lance Norskog <goks...@gmail.com> wrote:
>
> Thanks everyone- I hadn't considered the stem/synonym problem. I have
> code for regularizing a doc/term matrix with tf, binary, log and
> augmented norm for the cells and idf, gfidf, entropy, normal (term
> vector) and probabilistic inverse. Running any of these, and then SVD,
> on a Reuters article may take 10-20 ms. This uses a sentence/term
> matrix for document summarization. After doing all of this, I realized
> that maybe just the regularized matrix was good enough.
>
> One thing came through- parts-of-speech selection for nouns and verbs
> helped 5-10% in every combination of regularizers. All across the
> board. If you want good tags, select your parts of speech!
>
> On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
> <dawid.we...@cs.put.poznan.pl> wrote:
>> I know, I know. :) Just wanted to mention that it could lead to funny
>> results, that's all. There are lots of way of doing proper form
>> disambiguation, including shallow tagging which then allows to
>> retrieve correct base forms for lemmas, not stems. Stemming is
>> typically good enough (and fast) so your advise was 100% fine.
>>
>> Dawid
>>
>> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>> This is definitely just the first step.  Similar goofs happen with
>>> inappropriate stemming.  For instance, AIDS should not stem to aid.
>>>
>>> A reasonable way to find and classify exceptional cases is to look at
>>> cooccurrence statistics.  The contexts of original forms can be examined to
>>> find cases where there is a clear semantic mismatch between the original
>>> and the set of all forms that stem to the same form.
>>>
>>> But just picking the most common that is present in the document is a
>>> pretty good step for all that it produces some oddities.  The results are
>>> much better than showing a user the stemmed forms.
>>>
>>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss 
>>> <dawid.we...@cs.put.poznan.pl>wrote:
>>>
>>>>> Unstemming is pretty simple.  Just build an unstemming dictionary based
>>>> on
>>>>> seeing what word forms have lead to a stemmed form.  Include frequencies.
>>>>
>>>> This can lead to very funny (or not, depends how you look at it)
>>>> mistakes when different lemmas stem to the same token. How frequent
>>>> and important this phenomenon is varies from language to language (and
>>>> can be calculated apriori).
>>>>
>>>> Dawid
>>>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to