First post. I'm working on NER in the domain of literature.

Using standard NER I can pull out People names, authors like "Robert Louis
Stevenson" and character names like "Long John Silver". But of course there
is no distinction between real-life authors and fictional characters.

I've built my first custom model to identify Book Titles. It's just a quick
implementation for test purposes but it works quite well.

I'm considering building a custom model to identify Characters. What I know
now is that the model trainer uses tokens, POS, and proximity of words to
establish features. I can also add dictionaries and such. But I think one
key distinguishing feature of characters (vs People) is the "colorfulness",
or concrete imagery associated with character names:

Long John Silver
Tin Tin
Sherlock Holmes
Gandalf
Nigel Molesworth

By colourful, I mean that the names are more likely to use concrete imagery
(long, tin, mole) or have unique phonetic qualities (Sher Lock, Gan Dalf).
Sure, many characters have common names,  but I think I can use these
properties to help identify Character entities. I can come up with a
measure of concreteness, at least.

*My question is, if I knew the concreteness of tokens, is there any way I
can incorporate this measure into my custom model?*

I would prefer to avoid resorting to a dictionary. I think this would work
just like other word attributes, such as frequency, e.g., "home" is a more
frequently used word than "dwelling." Do models ever incorporate attributes
like token frequency? If yes, I could work from that.

*How about the use of phonetics?*

Any suggestions are appreciated.

Thanks, John




-- 
_________________________________________
johnmiedema.com

Reply via email to