On 01/07/2014 12:00 AM, Peter Thygesen wrote:
Just observed that training a token model with a text file without any
<SPLIT> tags will fail with the following error message:

Performing 100 iterations.

   1:  ... loglikelihood=0.0 1.0

   2:  ... loglikelihood=0.0 1.0

Exception in thread "main" java.lang.IllegalArgumentException:
opennlp.tools.util.InvalidFormatException: The maxent model is not
compatible with the tokenizer!

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476)

at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63)

at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253)

at
opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89)

at opennlp.tools.cmdline.CLI.main(CLI.java:222)

Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is
not compatible with the tokenizer!

at
opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155)

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474)

... 4 more

When I added just one! <SPLIT> tag it worked. But the documentations states:
"... The OpenNLP format contains one sentence per line. Tokens are either
separated by a*whitespace*  *or*  by a special <SPLIT> tag. The following
sample shows the sample from above in the correct format."

I understand that as it do not necessary have to contain <Split> tags just
whitespace.

Yes, that is correct you need at least one or a couple (depending on the settings) of <SPLIT> tags, to train the tokenizer without getting an exception. If there are no <SPLIT> tags a classification model with just one outcome is trained, and then rejected as invalid with an exception.

To train a tokenizer model, you need the training data to be in the correct format, and you need a certain quantity
to produce a model which actually works.

Maybe we should add some feedback to the command line training tool which informs the user if training a model is not possible, and if there is enough data what kind of performance is to be expected, e.g. less than 200 tokens, "The model will probably not work! Try to train on more data.".

Cheers,
Jörn


Reply via email to