The abbreviation list has almost no impact on the accuracy of the tokenizer,
it might help if you have data with very rare abbreviations, but its not
a feature
you should use when you just get started with the training.
My recommendation is to first get a good baseline tokenizer model, and
then if
you are not happy with it experiment with more advanced features or
customization.
I don't know how the dots are handled in the lookup code, maybe somebody
else does here,
otherwise I can have a look at the code.
Jörn
On 03/14/2013 05:24 PM, Andreas Niekler wrote:
Dear List,
do the abbreviations for the token trainer include the appending . or do
they just come in form of the actual string
like
e.g. vs. e.g
or
usw. vs. usw
or
Dr. vs. Dr
Thank you
Andreas
Am 14.03.2013 14:50, schrieb Jörn Kottmann:
On 03/14/2013 02:15 PM, Andreas Niekler wrote:
Hello,
seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501
Shoul i include that into 1.6.0 or just the trunk?
Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.
Jörn