Hi, Svetoslav,
On Fri, Jan 27, 2012 at 1:04 PM, Svetoslav Marinov < [email protected]> wrote: > Hi all, > > I have asked this question earlier in another thread but did not get > answer. > > I would like to train new models for Swedish for sentence detection, > tokenization, POStagging, NER since the existing models seem to perform > poorly on my data. > > I read the documentation and I surely followed the steps for training a > model from command line or the API, however, I am still not very happy > with the results. The existing Swedish models are trained on a small > Swedish corpus (Talbanken), while I have access to a much larger training > set (SUC corpus) . > > Here are the problems I face: > > 1. current Swedish model fails when sentences start with lower-case > letters. Also when there is no space between the full-stop and the next > sentence, the model splits the sentence but keeps the first word of the > second sentence as part of the first sentence. My current model cannot > split at all if there is no space after the full stop. > Your production data should have the same characteristics of your train data. It will only handle cases like sentences that starts with lower-case letters and no space between the full-stop and the next sentence properly if these were covered by the training data. You can add sentences with these characteristics to your training data. > > Questions: Can one influence the training set by creating examples where > there is no space between two sentences? Can one add features to the model? > If so how? What about if the training data contains sentences without end > of sentence markers (fullstops, etc.)? Should such exist in the training > set? Is there some example/documentation about it? > At least I cannot seem to find it – but correct me if I am wrong. > Yes, simply append it to your training data. I don't think the documentation covers how to create training corpus, it only explains the format, but any help to improve it is highly appreciated. > 2. The NER task suffers from similar problems. My current model has a hard > time recognizing single names but does OK is the name consists of several > words. However, I would like to include POS tag information as part of the > training features. Is this possible? If so how? Any examples/documentation > about it? > Does your corpus include examples of single names? Is it easy to distinguish it from other tokens? Maybe you should consider using a Custom Feature Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training.featuregen> to add a "dictionary" element. You can use the DictionaryBuilder tool "bin/opennlp DictionaryBuilder" I think you can add POS Tag information, but I don't know exactly how to do it. I would investigate if it is possible using the "custom" element of the Custom Feature Generation<http://incubator.apache.org/opennlp/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training.featuregen>, were you can pass a class that implements "AdaptiveFeatureGenerator". Another possible way, maybe even simpler, is to use the additionalContext argument of the NameFinder to pass the POS Tag info while training and executing the name finder. Should work, but I never tried it. Regards, William
