On 1/8/2013 10:42 AM, Adi wrote:
James Kosin <james.kosin@...> writes:


How many sentences do you have in the training set used to train your model?
What parameters did you use?
Do the sentences have a variation of sentences with and without
abbreviations?

James


Hi James,

I had a training corpus of around 1200 sentences.
Some of these sentences had abbreviations, but im trying to get it to perform
better with unseen abbreviations.

Which is why im trying the abbreviations dictionary.

I did not supply any training parameters, the documentation for what exactly to
supply is a little unclear. Could this be the issue?


With the training, you want to specify -iterations [number] parameter to train on the data more times... for that large a set a number like 500 may work well. The other option that will make a difference is -cutoff [number] parameter. The default is 5... meaning the training data will be filtered based on being able to see the same pattern at least 5 times before accepting as training data. Setting this to 0 will mean every line will count and not be filtered. You may want to work on this number more than the above one. The last major one is the -encoding [charset] parameter... be sure to set this properly for the encoding of the input data. If this is wrong you can end up with unpredictable results.

The output will look similar to this when done:
--

Performing 100 iterations.
  1:  ... loglikelihood=-222627.18862582813     0.962500740214366
  2:  ... loglikelihood=-32332.351994752924     0.9630139555081818
  3:  ... loglikelihood=-20417.068419466785     0.9669617654606107
  4:  ... loglikelihood=-15645.761778953061     0.972533112255976
--

The key is to get the likelihook as close to -1 and the last number as close to 1 as possible. It is okay if they don't equal -1 and 1. They just need to get close. The further they are away from those numbers the less reliable the model can be.

James

Reply via email to