On 06/26/2014 03:56 PM, John Miedema wrote:
First post. I'm working on NER in the domain of literature.
Using standard NER I can pull out People names, authors like "Robert Louis
Stevenson" and character names like "Long John Silver". But of course there
is no distinction between real-life authors and fictional characters.
I've built my first custom model to identify Book Titles. It's just a quick
implementation for test purposes but it works quite well.
I'm considering building a custom model to identify Characters. What I know
now is that the model trainer uses tokens, POS, and proximity of words to
establish features. I can also add dictionaries and such. But I think one
key distinguishing feature of characters (vs People) is the "colorfulness",
or concrete imagery associated with character names:
Long John Silver
Tin Tin
Sherlock Holmes
Gandalf
Nigel Molesworth
By colourful, I mean that the names are more likely to use concrete imagery
(long, tin, mole) or have unique phonetic qualities (Sher Lock, Gan Dalf).
Sure, many characters have common names, but I think I can use these
properties to help identify Character entities. I can come up with a
measure of concreteness, at least.
*My question is, if I knew the concreteness of tokens, is there any way I
can incorporate this measure into my custom model?*
I would prefer to avoid resorting to a dictionary. I think this would work
just like other word attributes, such as frequency, e.g., "home" is a more
frequently used word than "dwelling." Do models ever incorporate attributes
like token frequency? If yes, I could work from that.
*How about the use of phonetics?*
You can define your own feature generators and combine it with the
existing feature generators.
Right now the features are binary, they are either set or not. If you
have a strength/weight you might
be able to translate that to binary features. e.g by using a mapping
function.
If you decide to use a dictionary, have a look at wikipedia, maybe you
are able to link the entities
to wikipedia entries. They probably have some properties which indicate
if it is fictional or not.
Wikipedia is hard to use, but projects like dbpedia make these kind of
lookups possible.
HTH,
Jörn