Hi Katrin, I found an issue that should cause a bad impact while using an abbreviation dictionary. Maybe if you try again with the code from trunk you will have better results. OPENNLP-479 <https://issues.apache.org/jira/browse/OPENNLP-479>
Thank you, William On Wed, Feb 15, 2012 at 2:58 PM, Katrin Tomanek <[email protected]> wrote: > Hi William, > > thanks for sharing your experiences. > > I did another test: > * Default Context Generator > * Corpus: Genia > * Variant 1: no abbreviation dictionary > * Variant 2: big abbreviation dictionary of ~1000 entries > * Variant 3: small abbreviations dictionary of only common and well known > abbreviations (15 entries) > > here's what I get > * Variant 1 (F: 0.9910290237467019) > * Variant 2 (F: 0.9907676074914271) > * Variant 3 (F: 0.9910290237467019) > > --> so for me, using abbreviation dictionary does not help (at least not in > evaluation). > > However, when my users start finding common problems on abbreviations I > might start feeding an abbreviation dictionary which could handle those > maybe rare, but annoying problems... > > > Cheers > Katrin > > > On 02/15/2012 05:46 PM, [email protected] wrote: >> >> I performed a few experiments with two Portuguese corpus. All tests was >> with MAXENT, iterations 100 and cutoff 5. >> >> F1 results for a 96k sentences corpus: >> Default CG: 0.9853360692658026 >> Default CG + Abb: 0.9854463195403679 (+0.0001) >> >> Custom CG: 0.9911605417797043 >> Custom CG + Abb: 0.9911809163438341 (+0.00002) >> >> To create the custom context generator I added some features that I took >> from Tokenizer. >> >> The number indicates that the abbreviation dictionary barely increased F1. >> But trying the model I notice that in fact it performs better while >> handling abbreviations. I notice the same by running the cross validator >> with the option "-misclassified true" >> >> The feeling I have about it is that there are far more trivial cases, and >> the special cases that are affected by the abbreviation dictionary are so >> low that it doesn't affect the F1. >> >> I also tried with a 4k sentences corpus. F1 values: >> >> Custom CG: 0.9566960705693666 >> Custom CG + Abb: 0.958779443254818 (+0.002) >> >> William >> >> On Wed, Feb 15, 2012 at 1:37 PM, Katrin Tomanek >> <[email protected]>wrote: >> >>> Hi, >>> >>> I am trying to optimize my sentence detector model by adding an >>> abbreviation dictionary. >>> >>> Can anybody give some hints on best practices which abbreviations to add >>> here? E.g., only very frequent ones? Problematic ones? Any? >>> >>> I just experimented with a very big abbreviation dictionary and found >>> that, in german medical patient records, this rather decreases >>> performance. >>> >>> Any experiences were abbreviation dictionaries improved performance ? >>> >>> >>> Best >>> Katrin >>> >> > > > -- > Dr. Katrin Tomanek > Averbis GmbH > Tennenbacher Strasse 11 > D-79106 Freiburg > > Fon: +49 (0) 761 - 203 97696 > Fax: +49 (0) 761 - 203 97694 > E-Mail: [email protected] > > Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó > Sitz der Gesellschaft: Freiburg i. Br. > AG Freiburg i. Br., HRB 701080
