Hi, Mark. Thanks for your suggestion. My initial approach was to use
regular expressions, but I'm looking at social media and there is a lot
more variation in the formatting of phone numbers than you would expect (as
well as various kinds of obfuscation). So I think a named entity recognizer
will ultimately be more robust. Hence my interest in custom token classes.

Best,
Stuart


On Wed, May 21, 2014 at 6:09 PM, Mark Giaconia <[email protected]>wrote:

>
>
> Sounds like you could use a regexnamefinder since these patterns are so
> well defined with a set of rules.
>
> > On May 21, 2014, at 7:43 PM, Stuart Robinson <[email protected]>
> wrote:
> >
> > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
> > isn't a pre-existing model. I've been training my own and have gotten
> > pretty decent results so far with the simple tokenizer and out-of-the-box
> > features but I'd now like to improve the features that it's training on.
> In
> > particular, I'd like to define some token classes that are specific to
> the
> > domain of phone numbers. From what I've read so far (e.g., in Taming
> Text),
> > the out-of-the-box token classes are:
> >
> > 1. token is lowercase alphabetic
> > 2. token is two digits
> > 3. token is four digits
> > 4. token contains a number and a letter
> > 5. token contains a number and a hyphen
> > 6. token contains a number and a backlash
> > 7. token contains a number and a comma
> > 8. token contains a number and a period
> > 9. tokens contains a number
> > 10. token is all caps, single letter
> > 11. token is all caps, multiple letters
> > 12. token's initial letters are caps
> > 13. other
> >
> > I'd like to be able to define feature like the following:
> >
> > a. token is five digits
> > b. token is six digits
> > c. token is seven digits
> > d. token is eight digits
> > e. token is greater than eight digits
> > etc.
> >
> > I know that you can override features when calling NameFinderME.train by
> > passing in your own AggregatedFeatureGenerator object, but it's not clear
> > how an individual feature generator could use custom token classes.
> > Pointers to the appropriate entry point in the code (and any other
> > suggestions or advice) would be greatly appreciated.
> >
> > Thanks in advance.
> >
> > Regards,
> > Stuart
>

Reply via email to