On 01/07/2014 12:00 AM, Peter Thygesen wrote:
Just observed that training a token model with a text file without any
<SPLIT> tags will fail with the following error message:
Performing 100 iterations.
1: ... loglikelihood=0.0 1.0
2: ... loglikelihood=0.0 1.0
Exception in thread "main" java.lang.IllegalArgumentException:
opennlp.tools.util.InvalidFormatException: The maxent model is not
compatible with the tokenizer!
at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476)
at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63)
at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253)
at
opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89)
at opennlp.tools.cmdline.CLI.main(CLI.java:222)
Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is
not compatible with the tokenizer!
at
opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155)
at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474)
... 4 more
When I added just one! <SPLIT> tag it worked. But the documentations states:
"... The OpenNLP format contains one sentence per line. Tokens are either
separated by a*whitespace* *or* by a special <SPLIT> tag. The following
sample shows the sample from above in the correct format."
I understand that as it do not necessary have to contain <Split> tags just
whitespace.
Yes, that is correct you need at least one or a couple (depending on
the settings) of <SPLIT> tags,
to train the tokenizer without getting an exception. If there are no
<SPLIT> tags a classification model
with just one outcome is trained, and then rejected as invalid with an
exception.
To train a tokenizer model, you need the training data to be in the
correct format, and you need a certain quantity
to produce a model which actually works.
Maybe we should add some feedback to the command line training tool
which informs the user if training a model is
not possible, and if there is enough data what kind of performance is to
be expected,
e.g. less than 200 tokens, "The model will probably not work! Try to
train on more data.".
Cheers,
Jörn