Peter,

The tokenizer is really there to learn to split things that are not already tokenized with spaces. You can have normal non <SPLIT> lines in the file but the tokenizer really needs <SPLIT> lines to train on.

ie:  This can<SPLIT>'t be a good thing<SPLIT>.

When trained, the tokenizer then properly splits sentences to complete tokens. Like this:
  It was n't long before someone opened the book .

On 1/6/2014 6:00 PM, Peter Thygesen wrote:
Just observed that training a token model with a text file without any
<SPLIT> tags will fail with the following error message:

Performing 100 iterations.

   1:  ... loglikelihood=0.0 1.0

   2:  ... loglikelihood=0.0 1.0

Exception in thread "main" java.lang.IllegalArgumentException:
opennlp.tools.util.InvalidFormatException: The maxent model is not
compatible with the tokenizer!

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476)

at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63)

at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253)

at
opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89)

at opennlp.tools.cmdline.CLI.main(CLI.java:222)

Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is
not compatible with the tokenizer!

at
opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155)

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474)

... 4 more

When I added just one! <SPLIT> tag it worked. But the documentations states:
"... The OpenNLP format contains one sentence per line. Tokens are either
separated by a *whitespace* *or* by a special <SPLIT> tag. The following
sample shows the sample from above in the correct format."

I understand that as it do not necessary have to contain <Split> tags just
whitespace.

brgds,
Peter Thygesen


Reply via email to