I would post-process the output and hunt for urls and rebuild them.

I believe the statistical models are not fungible. More important,
these are statistical models and have an error rate. You can do a much
better job by putting together the pieces after the tokenizer takes
them apart.

On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote:
> Is there some way to augment a TokenizerME object without having to start
> with your own full set of training data?  For example, we run into cases
> where a TokenizerME with the standard "en-token.bin" data performs mostly
> well for us, but does not do a good job with inline URLs that are common in
> the text we're using.  (In most cases, it'll split these up so that "
> http://whatever.com"; becomes something like [ "http", ":", "/", "/",
> "whatever", "com" ].)
>
> Is there some way that we can continue using TokenizerME and the standard
> "en-token.bin" model, but augment it with our own logic to detect and
> tokenize URLs?  Or would we need to go all the way down to the model
> training level and come up with our own replacement for en-token.bin?
>
> Thanks,
> Jamey



-- 
Lance Norskog
[email protected]

Reply via email to