Hi, Mark. Thanks for your suggestion. My initial approach was to use regular expressions, but I'm looking at social media and there is a lot more variation in the formatting of phone numbers than you would expect (as well as various kinds of obfuscation). So I think a named entity recognizer will ultimately be more robust. Hence my interest in custom token classes.
Best, Stuart On Wed, May 21, 2014 at 6:09 PM, Mark Giaconia <[email protected]>wrote: > > > Sounds like you could use a regexnamefinder since these patterns are so > well defined with a set of rules. > > > On May 21, 2014, at 7:43 PM, Stuart Robinson <[email protected]> > wrote: > > > > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there > > isn't a pre-existing model. I've been training my own and have gotten > > pretty decent results so far with the simple tokenizer and out-of-the-box > > features but I'd now like to improve the features that it's training on. > In > > particular, I'd like to define some token classes that are specific to > the > > domain of phone numbers. From what I've read so far (e.g., in Taming > Text), > > the out-of-the-box token classes are: > > > > 1. token is lowercase alphabetic > > 2. token is two digits > > 3. token is four digits > > 4. token contains a number and a letter > > 5. token contains a number and a hyphen > > 6. token contains a number and a backlash > > 7. token contains a number and a comma > > 8. token contains a number and a period > > 9. tokens contains a number > > 10. token is all caps, single letter > > 11. token is all caps, multiple letters > > 12. token's initial letters are caps > > 13. other > > > > I'd like to be able to define feature like the following: > > > > a. token is five digits > > b. token is six digits > > c. token is seven digits > > d. token is eight digits > > e. token is greater than eight digits > > etc. > > > > I know that you can override features when calling NameFinderME.train by > > passing in your own AggregatedFeatureGenerator object, but it's not clear > > how an individual feature generator could use custom token classes. > > Pointers to the appropriate entry point in the code (and any other > > suggestions or advice) would be greatly appreciated. > > > > Thanks in advance. > > > > Regards, > > Stuart >
