The abbreviation list has almost no impact on the accuracy of the tokenizer,
it might help if you have data with very rare abbreviations, but its not a feature
you should use when you just get started with the training.

My recommendation is to first get a good baseline tokenizer model, and then if you are not happy with it experiment with more advanced features or customization.

I don't know how the dots are handled in the lookup code, maybe somebody else does here,
otherwise I can have a look at the code.

Jörn

On 03/14/2013 05:24 PM, Andreas Niekler wrote:
Dear List,

do the abbreviations for the token trainer include the appending . or do
they just come in form of the actual string

like

e.g. vs. e.g

or

usw. vs. usw

or

Dr. vs. Dr

Thank you

Andreas

Am 14.03.2013 14:50, schrieb Jörn Kottmann:
On 03/14/2013 02:15 PM, Andreas Niekler wrote:
Hello,

seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501

Shoul i include that into 1.6.0 or just the trunk?
Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.

Jörn


Reply via email to