I would post-process the output and hunt for urls and rebuild them. I believe the statistical models are not fungible. More important, these are statistical models and have an error rate. You can do a much better job by putting together the pieces after the tokenizer takes them apart.
On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote: > Is there some way to augment a TokenizerME object without having to start > with your own full set of training data? For example, we run into cases > where a TokenizerME with the standard "en-token.bin" data performs mostly > well for us, but does not do a good job with inline URLs that are common in > the text we're using. (In most cases, it'll split these up so that " > http://whatever.com" becomes something like [ "http", ":", "/", "/", > "whatever", "com" ].) > > Is there some way that we can continue using TokenizerME and the standard > "en-token.bin" model, but augment it with our own logic to detect and > tokenize URLs? Or would we need to go all the way down to the model > training level and come up with our own replacement for en-token.bin? > > Thanks, > Jamey -- Lance Norskog [email protected]
