Peter,
The tokenizer is really there to learn to split things that are not
already tokenized with spaces. You can have normal non <SPLIT> lines in
the file but the tokenizer really needs <SPLIT> lines to train on.
ie: This can<SPLIT>'t be a good thing<SPLIT>.
When trained, the tokenizer then properly splits sentences to complete
tokens. Like this:
It was n't long before someone opened the book .
On 1/6/2014 6:00 PM, Peter Thygesen wrote:
Just observed that training a token model with a text file without any
<SPLIT> tags will fail with the following error message:
Performing 100 iterations.
1: ... loglikelihood=0.0 1.0
2: ... loglikelihood=0.0 1.0
Exception in thread "main" java.lang.IllegalArgumentException:
opennlp.tools.util.InvalidFormatException: The maxent model is not
compatible with the tokenizer!
at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476)
at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63)
at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253)
at
opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89)
at opennlp.tools.cmdline.CLI.main(CLI.java:222)
Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is
not compatible with the tokenizer!
at
opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155)
at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474)
... 4 more
When I added just one! <SPLIT> tag it worked. But the documentations states:
"... The OpenNLP format contains one sentence per line. Tokens are either
separated by a *whitespace* *or* by a special <SPLIT> tag. The following
sample shows the sample from above in the correct format."
I understand that as it do not necessary have to contain <Split> tags just
whitespace.
brgds,
Peter Thygesen