custom token classes for NER model training

Stuart Robinson Wed, 21 May 2014 16:44:22 -0700

Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
isn't a pre-existing model. I've been training my own and have gotten
pretty decent results so far with the simple tokenizer and out-of-the-box
features but I'd now like to improve the features that it's training on. In
particular, I'd like to define some token classes that are specific to the
domain of phone numbers. From what I've read so far (e.g., in Taming Text),
the out-of-the-box token classes are:


1. token is lowercase alphabetic
2. token is two digits
3. token is four digits
4. token contains a number and a letter
5. token contains a number and a hyphen
6. token contains a number and a backlash
7. token contains a number and a comma
8. token contains a number and a period
9. tokens contains a number
10. token is all caps, single letter
11. token is all caps, multiple letters
12. token's initial letters are caps
13. other

I'd like to be able to define feature like the following:

a. token is five digits
b. token is six digits
c. token is seven digits
d. token is eight digits
e. token is greater than eight digits
etc.

I know that you can override features when calling NameFinderME.train by
passing in your own AggregatedFeatureGenerator object, but it's not clear
how an individual feature generator could use custom token classes.
Pointers to the appropriate entry point in the code (and any other
suggestions or advice) would be greatly appreciated.

Thanks in advance.

Regards,
Stuart

custom token classes for NER model training

Reply via email to