First post. I'm working on NER in the domain of literature. Using standard NER I can pull out People names, authors like "Robert Louis Stevenson" and character names like "Long John Silver". But of course there is no distinction between real-life authors and fictional characters.
I've built my first custom model to identify Book Titles. It's just a quick implementation for test purposes but it works quite well. I'm considering building a custom model to identify Characters. What I know now is that the model trainer uses tokens, POS, and proximity of words to establish features. I can also add dictionaries and such. But I think one key distinguishing feature of characters (vs People) is the "colorfulness", or concrete imagery associated with character names: Long John Silver Tin Tin Sherlock Holmes Gandalf Nigel Molesworth By colourful, I mean that the names are more likely to use concrete imagery (long, tin, mole) or have unique phonetic qualities (Sher Lock, Gan Dalf). Sure, many characters have common names, but I think I can use these properties to help identify Character entities. I can come up with a measure of concreteness, at least. *My question is, if I knew the concreteness of tokens, is there any way I can incorporate this measure into my custom model?* I would prefer to avoid resorting to a dictionary. I think this would work just like other word attributes, such as frequency, e.g., "home" is a more frequently used word than "dwelling." Do models ever incorporate attributes like token frequency? If yes, I could work from that. *How about the use of phonetics?* Any suggestions are appreciated. Thanks, John -- _________________________________________ johnmiedema.com
