Hi all,

I have asked this question earlier in another thread but did not get answer.

I would like to train new models for Swedish for sentence detection, 
tokenization, POStagging, NER since the existing models seem to perform poorly 
on my data.

I read the documentation and I surely followed the steps for training a model 
from command line or the API, however,  I am still not very happy with the 
results. The existing Swedish models are trained on a small Swedish corpus 
(Talbanken), while I have access to a much larger training set (SUC corpus) .

Here are the problems I face:

  1.  current Swedish model fails when sentences start with lower-case letters. 
Also when there is no space between the full-stop and the next sentence, the 
model splits the sentence but keeps the first word of the second sentence as 
part of the first sentence. My current model cannot split at all if there is no 
space after the full stop.

Questions: Can one influence the training set by creating examples where there 
is no space between two sentences? Can one add features to the model? If so 
how? What about if  the training data contains sentences without end of 
sentence markers (fullstops, etc.)? Should such exist in the training set? Is 
there some example/documentation about it?
At least I cannot seem to find it – but correct me if I am wrong.

2. The NER task suffers from similar problems. My current model has a hard time 
recognizing single names but does OK is the name consists of several words. 
However, I would like to include POS tag information as part of the training 
features. Is this possible? If so how? Any examples/documentation about it?

Thanks  in advance!

Best,
Svetoslav

Reply via email to