Is there some way to augment a TokenizerME object without having to start with your own full set of training data? For example, we run into cases where a TokenizerME with the standard "en-token.bin" data performs mostly well for us, but does not do a good job with inline URLs that are common in the text we're using. (In most cases, it'll split these up so that " http://whatever.com" becomes something like [ "http", ":", "/", "/", "whatever", "com" ].)
Is there some way that we can continue using TokenizerME and the standard "en-token.bin" model, but augment it with our own logic to detect and tokenize URLs? Or would we need to go all the way down to the model training level and come up with our own replacement for en-token.bin? Thanks, Jamey
