Thanks for the answer, William! I was pretty busy with other stuff but I soon try to find time for the suggestions you gave me. I hope these will improve the performance.
Best, Svetoslav On 2012-01-30 19:12, "[email protected]" <[email protected]> wrote: >Hi, Svetoslav, > > >On Fri, Jan 27, 2012 at 1:04 PM, Svetoslav Marinov < >[email protected]> wrote: > >> Hi all, >> >> I have asked this question earlier in another thread but did not get >> answer. >> >> I would like to train new models for Swedish for sentence detection, >> tokenization, POStagging, NER since the existing models seem to perform >> poorly on my data. >> >> I read the documentation and I surely followed the steps for training a >> model from command line or the API, however, I am still not very happy >> with the results. The existing Swedish models are trained on a small >> Swedish corpus (Talbanken), while I have access to a much larger >>training >> set (SUC corpus) . >> >> Here are the problems I face: >> >> 1. current Swedish model fails when sentences start with lower-case >> letters. Also when there is no space between the full-stop and the next >> sentence, the model splits the sentence but keeps the first word of the >> second sentence as part of the first sentence. My current model cannot >> split at all if there is no space after the full stop. >> > >Your production data should have the same characteristics of your train >data. It will only handle cases like sentences that starts with lower-case >letters and no space between the full-stop and the next sentence properly >if these were covered by the training data. You can add sentences with >these characteristics to your training data. > > >> >> Questions: Can one influence the training set by creating examples where >> there is no space between two sentences? Can one add features to the >>model? >> If so how? What about if the training data contains sentences without >>end >> of sentence markers (fullstops, etc.)? Should such exist in the training >> set? Is there some example/documentation about it? >> At least I cannot seem to find it but correct me if I am wrong. >> > >Yes, simply append it to your training data. I don't think the >documentation covers how to create training corpus, it only explains the >format, but any help to improve it is highly appreciated. > > >> 2. The NER task suffers from similar problems. My current model has a >>hard >> time recognizing single names but does OK is the name consists of >>several >> words. However, I would like to include POS tag information as part of >>the >> training features. Is this possible? If so how? Any >>examples/documentation >> about it? >> > >Does your corpus include examples of single names? Is it easy to >distinguish it from other tokens? Maybe you should consider using a Custom >Feature >Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubat >ing/manual/opennlp.html#tools.namefind.training.featuregen> >to >add a "dictionary" element. You can use the DictionaryBuilder tool >"bin/opennlp DictionaryBuilder" > >I think you can add POS Tag information, but I don't know exactly how to >do >it. I would investigate if it is possible using the "custom" element >of the Custom >Feature >Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubat >ing/manual/opennlp.html#tools.namefind.training.featuregen>, >were you can pass a class that implements "AdaptiveFeatureGenerator". > >Another possible way, maybe even simpler, is to use the additionalContext >argument of the NameFinder to pass the POS Tag info while training and >executing the name finder. Should work, but I never tried it. > >Regards, >William
