Hi Katrin,

I found an issue that should cause a bad impact while using an abbreviation
dictionary. Maybe if you try again with the code from trunk you will have
better results.
OPENNLP-479 <https://issues.apache.org/jira/browse/OPENNLP-479>

Thank you,
William


On Wed, Feb 15, 2012 at 2:58 PM, Katrin Tomanek <[email protected]>
wrote:
> Hi William,
>
> thanks for sharing your experiences.
>
> I did another test:
> * Default Context Generator
> * Corpus: Genia
> * Variant 1: no abbreviation dictionary
> * Variant 2: big abbreviation dictionary of ~1000 entries
> * Variant 3: small abbreviations dictionary of only common and well known
> abbreviations (15 entries)
>
> here's what I get
> * Variant 1 (F: 0.9910290237467019)
> * Variant 2 (F: 0.9907676074914271)
> * Variant 3 (F: 0.9910290237467019)
>
> --> so for me, using abbreviation dictionary does not help (at least not
in
> evaluation).
>
> However, when my users start finding common problems on abbreviations I
> might start feeding an abbreviation dictionary which could handle those
> maybe rare, but annoying problems...
>
>
> Cheers
> Katrin
>
>
> On 02/15/2012 05:46 PM, [email protected] wrote:
>>
>> I performed a few experiments with two Portuguese corpus. All tests was
>> with MAXENT, iterations 100 and cutoff 5.
>>
>> F1 results for a 96k sentences corpus:
>> Default CG: 0.9853360692658026
>> Default CG + Abb: 0.9854463195403679 (+0.0001)
>>
>> Custom CG: 0.9911605417797043
>> Custom CG + Abb: 0.9911809163438341 (+0.00002)
>>
>> To create the custom context generator I added some features that I took
>> from Tokenizer.
>>
>> The number indicates that the abbreviation dictionary barely increased
F1.
>> But trying the model I notice that in fact it performs better while
>> handling abbreviations. I notice the same by running the cross validator
>> with the option "-misclassified true"
>>
>> The feeling I have about it is that there are far more trivial cases, and
>> the special cases that are affected by the abbreviation dictionary are so
>> low that it doesn't affect the F1.
>>
>> I also tried with a 4k sentences corpus. F1 values:
>>
>> Custom CG: 0.9566960705693666
>> Custom CG + Abb: 0.958779443254818 (+0.002)
>>
>> William
>>
>> On Wed, Feb 15, 2012 at 1:37 PM, Katrin Tomanek
>> <[email protected]>wrote:
>>
>>> Hi,
>>>
>>> I am trying to optimize my sentence detector model by adding an
>>> abbreviation dictionary.
>>>
>>> Can anybody give some hints on best practices which abbreviations to add
>>> here? E.g., only very frequent ones? Problematic ones? Any?
>>>
>>> I just experimented with a very big abbreviation dictionary and found
>>> that, in german medical patient records, this rather decreases
>>> performance.
>>>
>>> Any experiences were abbreviation dictionaries improved performance ?
>>>
>>>
>>> Best
>>> Katrin
>>>
>>
>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: [email protected]
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080

Reply via email to