On 03/14/2013 11:44 AM, Andreas Niekler wrote:
what i don't understand is how this is producing valid training data
since i just delete whitespaces. You said that i need to include some
<SPLIT> Tags to have proper training data. Can you please comment on the
fact why we have proper training data after detokenizing. I hope that
it's ok to ask all these querstions but i really want to understand
openNLP tokenisation.
The training data needs to reflect the data you want to process.
In German (like in English) most tokens are separated by white spaces
already, and
punctuation and word tokens might be written together without a
separating white space,
to encode the latter case in the training data we use the <SPLIT> tag.
If you just replace all white spaces with <SPLIT> tags in your white
space tokenized data the
input data probably does not longer match the training data. To make the
input data match
it again you would need to remove all white spaces from it.
Can you give us more details about your training data? Is it white space
tokenized?
Jörn