For your particular use case of detecting URLs, another way is to preprocess 
your sentence with a custom URL regex detector, storing the detected URLs in a 
hash map, replacing the detected URLS in the sentence with their hash (or even 
something like "URL1", "URL2" etc , [which should not occur naturally in your 
text]), then run it through the opennlp tokenizer, then postprocess the 
resulting tokens to replace each hash occurrence with the corresponding URLs 
from the hashmap. The idea is that the replacement values inserted during 
processing would come out of the tokenizer as a separate token, so they can 
easily be replaced by their corresponding URLs extracted by the regex. Since 
the tokenizer operates per-sentence, the hashmap size will be small. 
This approach can be used for any regex based token detector, like for emails, 
decimal numbers, etc.

-Jeyendran


-----Original Message-----
From: Lance Norskog [mailto:[email protected]] 
Sent: Wednesday, July 18, 2012 11:33 PM
To: [email protected]
Subject: Re: Augmenting TokenizerME

I would post-process the output and hunt for urls and rebuild them.

I believe the statistical models are not fungible. More important, these are 
statistical models and have an error rate. You can do a much better job by 
putting together the pieces after the tokenizer takes them apart.

On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote:
> Is there some way to augment a TokenizerME object without having to 
> start with your own full set of training data?  For example, we run 
> into cases where a TokenizerME with the standard "en-token.bin" data 
> performs mostly well for us, but does not do a good job with inline 
> URLs that are common in the text we're using.  (In most cases, it'll split 
> these up so that "
> http://whatever.com"; becomes something like [ "http", ":", "/", "/", 
> "whatever", "com" ].)
>
> Is there some way that we can continue using TokenizerME and the 
> standard "en-token.bin" model, but augment it with our own logic to 
> detect and tokenize URLs?  Or would we need to go all the way down to 
> the model training level and come up with our own replacement for 
> en-token.bin?
>
> Thanks,
> Jamey



--
Lance Norskog
[email protected]

Reply via email to