Thank you both, Lance and Jeyendran. I am using a post-processing approach along the lines of what you've suggested. I just wanted to be sure there wasn't some better practice that I was overlooking.
Thanks, Jamey On Thu, Jul 19, 2012 at 10:42 AM, Jeyendran Balakrishnan < [email protected]> wrote: > For your particular use case of detecting URLs, another way is to > preprocess your sentence with a custom URL regex detector, storing the > detected URLs in a hash map, replacing the detected URLS in the sentence > with their hash (or even something like "URL1", "URL2" etc , [which should > not occur naturally in your text]), then run it through the opennlp > tokenizer, then postprocess the resulting tokens to replace each hash > occurrence with the corresponding URLs from the hashmap. The idea is that > the replacement values inserted during processing would come out of the > tokenizer as a separate token, so they can easily be replaced by their > corresponding URLs extracted by the regex. Since the tokenizer operates > per-sentence, the hashmap size will be small. > This approach can be used for any regex based token detector, like for > emails, decimal numbers, etc. > > -Jeyendran > > > -----Original Message----- > From: Lance Norskog [mailto:[email protected]] > Sent: Wednesday, July 18, 2012 11:33 PM > To: [email protected] > Subject: Re: Augmenting TokenizerME > > I would post-process the output and hunt for urls and rebuild them. > > I believe the statistical models are not fungible. More important, these > are statistical models and have an error rate. You can do a much better job > by putting together the pieces after the tokenizer takes them apart. > > On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote: > > Is there some way to augment a TokenizerME object without having to > > start with your own full set of training data? For example, we run > > into cases where a TokenizerME with the standard "en-token.bin" data > > performs mostly well for us, but does not do a good job with inline > > URLs that are common in the text we're using. (In most cases, it'll > split these up so that " > > http://whatever.com" becomes something like [ "http", ":", "/", "/", > > "whatever", "com" ].) > > > > Is there some way that we can continue using TokenizerME and the > > standard "en-token.bin" model, but augment it with our own logic to > > detect and tokenize URLs? Or would we need to go all the way down to > > the model training level and come up with our own replacement for > en-token.bin? > > > > Thanks, > > Jamey > > > > -- > Lance Norskog > [email protected] > >
