Sounds like you could use a regexnamefinder since these patterns are so well 
defined with a set of rules.

> On May 21, 2014, at 7:43 PM, Stuart Robinson <[email protected]> 
> wrote:
> 
> Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
> isn't a pre-existing model. I've been training my own and have gotten
> pretty decent results so far with the simple tokenizer and out-of-the-box
> features but I'd now like to improve the features that it's training on. In
> particular, I'd like to define some token classes that are specific to the
> domain of phone numbers. From what I've read so far (e.g., in Taming Text),
> the out-of-the-box token classes are:
> 
> 1. token is lowercase alphabetic
> 2. token is two digits
> 3. token is four digits
> 4. token contains a number and a letter
> 5. token contains a number and a hyphen
> 6. token contains a number and a backlash
> 7. token contains a number and a comma
> 8. token contains a number and a period
> 9. tokens contains a number
> 10. token is all caps, single letter
> 11. token is all caps, multiple letters
> 12. token's initial letters are caps
> 13. other
> 
> I'd like to be able to define feature like the following:
> 
> a. token is five digits
> b. token is six digits
> c. token is seven digits
> d. token is eight digits
> e. token is greater than eight digits
> etc.
> 
> I know that you can override features when calling NameFinderME.train by
> passing in your own AggregatedFeatureGenerator object, but it's not clear
> how an individual feature generator could use custom token classes.
> Pointers to the appropriate entry point in the code (and any other
> suggestions or advice) would be greatly appreciated.
> 
> Thanks in advance.
> 
> Regards,
> Stuart

Reply via email to