Hi All, Quite a while ago I started a discussion on this list about Event Extraction from text. See https://issues.apache.org/jira/browse/STANBOL-1121 .
I'd like to get started on the actual work and I have been thinking how to best approach this and there are some things that I would do differently than what the JIRA describes.I'd like to get your feedback on it. Basically the main approach would be: 1. Detect all NERs and their co-references. 2. Apply semantic role labeling on the sentences where the above mentioned NERs reside. I found some interesting Semantic Role labeling libraries such as https://code.google.com/p/mate-tools/ or http://cogcomp.cs.illinois.edu/page/software_view/SRL. With this I'll be able to detect the Agent, the Verb (action) and the Patient and Instruments. This could be a minimal implementation of the engine. After that I can simply create the event data model as described in the JIRA and annotate the text. But this does not actually detect what kind of event it is or what are the event specific roles that the entities have in the relation. For example we can have the sentence "Google buys Yahoo for $100 million". There are a lot more to be said about this sentence than simply that "Google" is the agent and "Yahoo" is the Patient. This is actually an acquisition event and "Google" is the buyer and "Yahoo" the bought entity. We also would need to align to a common ontology synonym phrases such as "buy" or "acquire" so that we know that both refer to the same Acquisition event. Having said that, we would add a new step : 3. Try to detect event type and event details. This can be done by either: 3.1 Rule based : hand written rules which would map a certain sentence structure, such as the name of the verb and the type of entities as agent, patient to a certain event type. This has the benefit of being easy to build but quite inflexible. 3.2 Statistical based: train a model which would be able to classify an event type based on the features of the sentence such as verb type, entity type, role type, etc.. This is the approach described here : http://web.stanford.edu/~jurafsky/mintz.pdf. This would be quite hard to build but quite flexible. This 3rd step of detecting event types & details I think would be most efficient for domain specific events. We would have configs with several models for several domains available and the user could with use one of the pre-existent models or create a new one. I don't have any practical experience with training models or text classification based on features (but I've been doing a lot of reading on it) so I'm not sure exactly how feasible what I said at point no 3 actually is. Regards, Cristian