Hi, all. I'm using OpenNLP to recognize phone numbers, for which there isn't a pre-existing model. I've been training my own and have gotten pretty decent results so far with the simple tokenizer and out-of-the-box features but I'd now like to improve the features that it's training on. In particular, I'd like to define some token classes that are specific to the domain of phone numbers. From what I've read so far (e.g., in Taming Text), the out-of-the-box token classes are:
1. token is lowercase alphabetic 2. token is two digits 3. token is four digits 4. token contains a number and a letter 5. token contains a number and a hyphen 6. token contains a number and a backlash 7. token contains a number and a comma 8. token contains a number and a period 9. tokens contains a number 10. token is all caps, single letter 11. token is all caps, multiple letters 12. token's initial letters are caps 13. other I'd like to be able to define feature like the following: a. token is five digits b. token is six digits c. token is seven digits d. token is eight digits e. token is greater than eight digits etc. I know that you can override features when calling NameFinderME.train by passing in your own AggregatedFeatureGenerator object, but it's not clear how an individual feature generator could use custom token classes. Pointers to the appropriate entry point in the code (and any other suggestions or advice) would be greatly appreciated. Thanks in advance. Regards, Stuart
