Re: sentence detector with abbreviations not working

James Kosin Wed, 09 Jan 2013 19:44:30 -0800

On 1/8/2013 10:42 AM, Adi wrote:

James Kosin <james.kosin@...> writes:

How many sentences do you have in the training set used to train your model?
What parameters did you use?
Do the sentences have a variation of sentences with and without
abbreviations?

James


Hi James,

I had a training corpus of around 1200 sentences.
Some of these sentences had abbreviations, but im trying to get it to perform
better with unseen abbreviations.

Which is why im trying the abbreviations dictionary.

I did not supply any training parameters, the documentation for what exactly to
supply is a little unclear. Could this be the issue?

With the training, you want to specify -iterations [number] parameter totrain on the data more times... for that large a set a number like 500may work well.The other option that will make a difference is -cutoff [number]parameter. The default is 5... meaning the training data will befiltered based on being able to see the same pattern at least 5 timesbefore accepting as training data. Setting this to 0 will mean everyline will count and not be filtered. You may want to work on thisnumber more than the above one.The last major one is the -encoding [charset] parameter... be sure toset this properly for the encoding of the input data. If this is wrongyou can end up with unpredictable results.


The output will look similar to this when done:
--

Performing 100 iterations.
  1:  ... loglikelihood=-222627.18862582813     0.962500740214366
  2:  ... loglikelihood=-32332.351994752924     0.9630139555081818
  3:  ... loglikelihood=-20417.068419466785     0.9669617654606107
  4:  ... loglikelihood=-15645.761778953061     0.972533112255976
--

The key is to get the likelihook as close to -1 and the last number asclose to 1 as possible. It is okay if they don't equal -1 and 1. Theyjust need to get close. The further they are away from those numbersthe less reliable the model can be.


James

Re: sentence detector with abbreviations not working

Reply via email to