On 1/8/2013 10:42 AM, Adi wrote:
James Kosin <james.kosin@...> writes:
How many sentences do you have in the training set used to train your model?
What parameters did you use?
Do the sentences have a variation of sentences with and without
abbreviations?
James
Hi James,
I had a training corpus of around 1200 sentences.
Some of these sentences had abbreviations, but im trying to get it to perform
better with unseen abbreviations.
Which is why im trying the abbreviations dictionary.
I did not supply any training parameters, the documentation for what exactly to
supply is a little unclear. Could this be the issue?
With the training, you want to specify -iterations [number] parameter to
train on the data more times... for that large a set a number like 500
may work well.
The other option that will make a difference is -cutoff [number]
parameter. The default is 5... meaning the training data will be
filtered based on being able to see the same pattern at least 5 times
before accepting as training data. Setting this to 0 will mean every
line will count and not be filtered. You may want to work on this
number more than the above one.
The last major one is the -encoding [charset] parameter... be sure to
set this properly for the encoding of the input data. If this is wrong
you can end up with unpredictable results.
The output will look similar to this when done:
--
Performing 100 iterations.
1: ... loglikelihood=-222627.18862582813 0.962500740214366
2: ... loglikelihood=-32332.351994752924 0.9630139555081818
3: ... loglikelihood=-20417.068419466785 0.9669617654606107
4: ... loglikelihood=-15645.761778953061 0.972533112255976
--
The key is to get the likelihook as close to -1 and the last number as
close to 1 as possible. It is okay if they don't equal -1 and 1. They
just need to get close. The further they are away from those numbers
the less reliable the model can be.
James