Event Extraction Engine

Cristian Petroaca Sun, 15 Feb 2015 03:38:02 -0800

Hi All,

Quite a while ago I started a discussion on this list about Event
Extraction from text. See https://issues.apache.org/jira/browse/STANBOL-1121
.


I'd like to get started on the actual work and I have been thinking how to
best approach this and there are some things that I would do differently
than what the JIRA describes.I'd like to get your feedback on it.

Basically the main approach would be:

1. Detect all NERs and their co-references.

2. Apply semantic role labeling on the sentences where the above mentioned
NERs reside.
I found some interesting Semantic Role labeling libraries such as
https://code.google.com/p/mate-tools/ or
http://cogcomp.cs.illinois.edu/page/software_view/SRL.
With this I'll be able to detect the Agent, the Verb (action) and the
Patient and Instruments.

This could be a minimal implementation of the engine. After that I can
simply create the event data model as described in the JIRA and annotate
the text.
But this does not actually detect what kind of event it is or what are the
event specific roles that the entities have in the relation.

For example we can have the sentence "Google buys Yahoo for $100 million".
There are a lot more to be said about this sentence than simply that
"Google" is the agent and "Yahoo" is the Patient. This is actually an
acquisition event and "Google" is the buyer and "Yahoo" the bought entity.
We also would need to align to a common ontology synonym phrases such as
"buy" or "acquire" so that we know that both refer to the same Acquisition
event.

Having said that, we would add a new step :
3. Try to detect event type and event details.

This can be done by either:

3.1 Rule based : hand written rules which would map a certain sentence
structure, such as the name of the verb and the type of entities as agent,
patient to a certain event type.
This has the benefit of being easy to build but quite inflexible.

3.2 Statistical based: train a model which would be able to classify an
event type based on the features of the sentence such as verb type, entity
type, role type, etc.. This is the approach described here :
http://web.stanford.edu/~jurafsky/mintz.pdf.
This would be quite hard to build but quite flexible.

This 3rd step of detecting event types & details I think would be most
efficient for domain specific events. We would have configs with several
models for several domains available and the user could with use one of the
pre-existent models or create a new one.

I don't have any practical experience with training models or text
classification based on features (but I've been doing a lot of reading on
it) so I'm not sure exactly how feasible what I said at point no 3 actually
is.

Regards,
Cristian

Event Extraction Engine

Reply via email to