James Kosin <james.kosin@...> writes:
> How many sentences do you have in the training set used to train your model?
> What parameters did you use?
> Do the sentences have a variation of sentences with and without
> abbreviations?
>
> James
>
Hi James,
I had a training corpus of around 1200 sentences.
Some of these sentences had abbreviations, but im trying to get it to perform
better with unseen abbreviations.
Which is why im trying the abbreviations dictionary.
I did not supply any training parameters, the documentation for what exactly to
supply is a little unclear. Could this be the issue?
________________________________________________________________________________
Dictionary abbrDict = new Dictionary();
abbrDict = new Dictionary( new FileInputStream(new File(pathToAbbr)));
ObjectStream<String> lineStream = new PlainTextByLineStream(new
FileInputStream(pathToData), "UTF-8");
ObjectStream<SentenceSample> sampleStream = new
SentenceSampleStream(lineStream);
SentenceDetectorFactory sdfac = new SentenceDetectorFactory("en", true,
abbrDict, null);
TrainingParameters trainParams = new TrainingParameters();
model = SentenceDetectorME.train("en", sampleStream, sdfac, trainParams);
_______________________________________________________________________________
this is the format of the abbreviations xml file im using, i got this from one
of the forums as well.
<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">
<entry>
<token>tel.</token>
</entry>
<entry>
<token>Jr.</token>
</entry>
<entry>
<token>Mrs.</token>
</entry>
</dictionary>
_______________________________________________________________________________
Thanks again for your assistance.
Adi