Sounds like you could use a regexnamefinder since these patterns are so well defined with a set of rules.
> On May 21, 2014, at 7:43 PM, Stuart Robinson <[email protected]> > wrote: > > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there > isn't a pre-existing model. I've been training my own and have gotten > pretty decent results so far with the simple tokenizer and out-of-the-box > features but I'd now like to improve the features that it's training on. In > particular, I'd like to define some token classes that are specific to the > domain of phone numbers. From what I've read so far (e.g., in Taming Text), > the out-of-the-box token classes are: > > 1. token is lowercase alphabetic > 2. token is two digits > 3. token is four digits > 4. token contains a number and a letter > 5. token contains a number and a hyphen > 6. token contains a number and a backlash > 7. token contains a number and a comma > 8. token contains a number and a period > 9. tokens contains a number > 10. token is all caps, single letter > 11. token is all caps, multiple letters > 12. token's initial letters are caps > 13. other > > I'd like to be able to define feature like the following: > > a. token is five digits > b. token is six digits > c. token is seven digits > d. token is eight digits > e. token is greater than eight digits > etc. > > I know that you can override features when calling NameFinderME.train by > passing in your own AggregatedFeatureGenerator object, but it's not clear > how an individual feature generator could use custom token classes. > Pointers to the appropriate entry point in the code (and any other > suggestions or advice) would be greatly appreciated. > > Thanks in advance. > > Regards, > Stuart
