Re: Abbreviations for Tokenisation Training

Jörn Kottmann Thu, 14 Mar 2013 10:29:53 -0700

The abbreviation list has almost no impact on the accuracy of the tokenizer,

it might help if you have data with very rare abbreviations, but its nota feature

you should use when you just get started with the training.

My recommendation is to first get a good baseline tokenizer model, andthen ifyou are not happy with it experiment with more advanced features orcustomization.

I don't know how the dots are handled in the lookup code, maybe somebodyelse does here,

otherwise I can have a look at the code.

Jörn

On 03/14/2013 05:24 PM, Andreas Niekler wrote:

Dear List,

do the abbreviations for the token trainer include the appending . or do
they just come in form of the actual string

like

e.g. vs. e.g

or

usw. vs. usw

or

Dr. vs. Dr

Thank you

Andreas

Am 14.03.2013 14:50, schrieb Jörn Kottmann:

On 03/14/2013 02:15 PM, Andreas Niekler wrote:

Hello,

seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501

Shoul i include that into 1.6.0 or just the trunk?

Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.

Jörn

Re: Abbreviations for Tokenisation Training

Reply via email to