Re: The maxent model is not compatible with the tokenizer!

Jörn Kottmann Tue, 07 Jan 2014 00:49:42 -0800

On 01/07/2014 12:00 AM, Peter Thygesen wrote:

Just observed that training a token model with a text file without any
<SPLIT> tags will fail with the following error message:


Performing 100 iterations.

   1:  ... loglikelihood=0.0 1.0

   2:  ... loglikelihood=0.0 1.0

Exception in thread "main" java.lang.IllegalArgumentException:
opennlp.tools.util.InvalidFormatException: The maxent model is not
compatible with the tokenizer!

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476)

at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63)

at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253)

at
opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89)

at opennlp.tools.cmdline.CLI.main(CLI.java:222)

Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is
not compatible with the tokenizer!

at
opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155)

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474)

... 4 more

When I added just one! <SPLIT> tag it worked. But the documentations states:
"... The OpenNLP format contains one sentence per line. Tokens are either
separated by a*whitespace*  *or*  by a special <SPLIT> tag. The following
sample shows the sample from above in the correct format."

I understand that as it do not necessary have to contain <Split> tags just
whitespace.

Yes, that is correct you need at least one or a couple (depending onthe settings) of <SPLIT> tags,to train the tokenizer without getting an exception. If there are no<SPLIT> tags a classification modelwith just one outcome is trained, and then rejected as invalid with anexception.

To train a tokenizer model, you need the training data to be in thecorrect format, and you need a certain quantity

to produce a model which actually works.

Maybe we should add some feedback to the command line training toolwhich informs the user if training a model isnot possible, and if there is enough data what kind of performance is tobe expected,e.g. less than 200 tokens, "The model will probably not work! Try totrain on more data.".


Cheers,
Jörn

Re: The maxent model is not compatible with the tokenizer!

Reply via email to