Re: TokenizerTrainer

Jörn Kottmann Thu, 14 Mar 2013 04:08:40 -0700

On 03/14/2013 11:44 AM, Andreas Niekler wrote:

what i don't understand is how this is producing valid training data
since i just delete whitespaces. You said that i need to include some
<SPLIT> Tags to have proper training data. Can you please comment on the
fact why we have proper training data after detokenizing. I hope that
it's ok to ask all these querstions but i really want to understand
openNLP tokenisation.


The training data needs to reflect the data you want to process.

In German (like in English) most tokens are separated by white spacesalready, andpunctuation and word tokens might be written together without aseparating white space,

to encode the latter case in the training data we use the <SPLIT> tag.

If you just replace all white spaces with <SPLIT> tags in your whitespace tokenized data theinput data probably does not longer match the training data. To make theinput data match

it again you would need to remove all white spaces from it.

Can you give us more details about your training data? Is it white spacetokenized?


Jörn

Re: TokenizerTrainer

Reply via email to