On 06/26/2014 03:56 PM, John Miedema wrote:
First post. I'm working on NER in the domain of literature.

Using standard NER I can pull out People names, authors like "Robert Louis
Stevenson" and character names like "Long John Silver". But of course there
is no distinction between real-life authors and fictional characters.

I've built my first custom model to identify Book Titles. It's just a quick
implementation for test purposes but it works quite well.

I'm considering building a custom model to identify Characters. What I know
now is that the model trainer uses tokens, POS, and proximity of words to
establish features. I can also add dictionaries and such. But I think one
key distinguishing feature of characters (vs People) is the "colorfulness",
or concrete imagery associated with character names:

Long John Silver
Tin Tin
Sherlock Holmes
Gandalf
Nigel Molesworth

By colourful, I mean that the names are more likely to use concrete imagery
(long, tin, mole) or have unique phonetic qualities (Sher Lock, Gan Dalf).
Sure, many characters have common names,  but I think I can use these
properties to help identify Character entities. I can come up with a
measure of concreteness, at least.

*My question is, if I knew the concreteness of tokens, is there any way I
can incorporate this measure into my custom model?*

I would prefer to avoid resorting to a dictionary. I think this would work
just like other word attributes, such as frequency, e.g., "home" is a more
frequently used word than "dwelling." Do models ever incorporate attributes
like token frequency? If yes, I could work from that.

*How about the use of phonetics?*


You can define your own feature generators and combine it with the existing feature generators. Right now the features are binary, they are either set or not. If you have a strength/weight you might be able to translate that to binary features. e.g by using a mapping function.

If you decide to use a dictionary, have a look at wikipedia, maybe you are able to link the entities to wikipedia entries. They probably have some properties which indicate if it is fictional or not. Wikipedia is hard to use, but projects like dbpedia make these kind of lookups possible.

HTH,
Jörn

Reply via email to