On 04/02/2013 06:02 PM, Benson Margulies wrote:
It seems to me to be an invariant that the training and runtime
environments have to agree on the input. In this case, it's a matter
of agreeing on the text normalization (in the Unicode sense) and the
tokenization. I doubt that it is viable to construct a model and
runtime that adapt to some disparate collection of possible
normalizations and tokenizations.

I didn't use "normalization" here in the Unicode sense, some of
the corpora we use (e.g. Penn Treebank) are unified to only use
certain tokens for quotes, brackets, etc., these unifications should
as well be done for the runtime environment.

We currently have no tool in OpenNLP which can do this for the user
an I propose that we add one.

Jörn

Reply via email to