Is there some way to augment a TokenizerME object without having to start
with your own full set of training data?  For example, we run into cases
where a TokenizerME with the standard "en-token.bin" data performs mostly
well for us, but does not do a good job with inline URLs that are common in
the text we're using.  (In most cases, it'll split these up so that "
http://whatever.com"; becomes something like [ "http", ":", "/", "/",
"whatever", "com" ].)

Is there some way that we can continue using TokenizerME and the standard
"en-token.bin" model, but augment it with our own logic to detect and
tokenize URLs?  Or would we need to go all the way down to the model
training level and come up with our own replacement for en-token.bin?

Thanks,
Jamey

Reply via email to