I know, I know. :) Just wanted to mention that it could lead to funny
results, that's all. There are lots of way of doing proper form
disambiguation, including shallow tagging which then allows to
retrieve correct base forms for lemmas, not stems. Stemming is
typically good enough (and fast) so your advise was 100% fine.

Dawid

On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> This is definitely just the first step.  Similar goofs happen with
> inappropriate stemming.  For instance, AIDS should not stem to aid.
>
> A reasonable way to find and classify exceptional cases is to look at
> cooccurrence statistics.  The contexts of original forms can be examined to
> find cases where there is a clear semantic mismatch between the original
> and the set of all forms that stem to the same form.
>
> But just picking the most common that is present in the document is a
> pretty good step for all that it produces some oddities.  The results are
> much better than showing a user the stemmed forms.
>
> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss 
> <dawid.we...@cs.put.poznan.pl>wrote:
>
>> > Unstemming is pretty simple.  Just build an unstemming dictionary based
>> on
>> > seeing what word forms have lead to a stemmed form.  Include frequencies.
>>
>> This can lead to very funny (or not, depends how you look at it)
>> mistakes when different lemmas stem to the same token. How frequent
>> and important this phenomenon is varies from language to language (and
>> can be calculated apriori).
>>
>> Dawid
>>

Reply via email to