The regex trick is nice!

On Thu, Jul 19, 2012 at 9:52 AM, Jamey Wood <[email protected]> wrote:
> Thank you both, Lance and Jeyendran.  I am using a post-processing approach
> along the lines of what you've suggested.  I just wanted to be sure there
> wasn't some better practice that I was overlooking.
>
> Thanks,
> Jamey
>
> On Thu, Jul 19, 2012 at 10:42 AM, Jeyendran Balakrishnan <
> [email protected]> wrote:
>
>> For your particular use case of detecting URLs, another way is to
>> preprocess your sentence with a custom URL regex detector, storing the
>> detected URLs in a hash map, replacing the detected URLS in the sentence
>> with their hash (or even something like "URL1", "URL2" etc , [which should
>> not occur naturally in your text]), then run it through the opennlp
>> tokenizer, then postprocess the resulting tokens to replace each hash
>> occurrence with the corresponding URLs from the hashmap. The idea is that
>> the replacement values inserted during processing would come out of the
>> tokenizer as a separate token, so they can easily be replaced by their
>> corresponding URLs extracted by the regex. Since the tokenizer operates
>> per-sentence, the hashmap size will be small.
>> This approach can be used for any regex based token detector, like for
>> emails, decimal numbers, etc.
>>
>> -Jeyendran
>>
>>
>> -----Original Message-----
>> From: Lance Norskog [mailto:[email protected]]
>> Sent: Wednesday, July 18, 2012 11:33 PM
>> To: [email protected]
>> Subject: Re: Augmenting TokenizerME
>>
>> I would post-process the output and hunt for urls and rebuild them.
>>
>> I believe the statistical models are not fungible. More important, these
>> are statistical models and have an error rate. You can do a much better job
>> by putting together the pieces after the tokenizer takes them apart.
>>
>> On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote:
>> > Is there some way to augment a TokenizerME object without having to
>> > start with your own full set of training data?  For example, we run
>> > into cases where a TokenizerME with the standard "en-token.bin" data
>> > performs mostly well for us, but does not do a good job with inline
>> > URLs that are common in the text we're using.  (In most cases, it'll
>> split these up so that "
>> > http://whatever.com"; becomes something like [ "http", ":", "/", "/",
>> > "whatever", "com" ].)
>> >
>> > Is there some way that we can continue using TokenizerME and the
>> > standard "en-token.bin" model, but augment it with our own logic to
>> > detect and tokenize URLs?  Or would we need to go all the way down to
>> > the model training level and come up with our own replacement for
>> en-token.bin?
>> >
>> > Thanks,
>> > Jamey
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>>



-- 
Lance Norskog
[email protected]

Reply via email to