Hi Rupert, Agreed on the SettingAnnotation/ParticipantAnnotation/OccurentAnnotation data structure.
Should I open up a Jira for all of this in order to encapsulate this information and establish the goals and these initial steps towards these goals? How should I proceed further? Should I create some design documents that need to be reviewed? Regards, Cristian 2013/6/17 Rupert Westenthaler <rupert.westentha...@gmail.com> > On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca > <cristian.petro...@gmail.com> wrote: > > HI Rupert, > > > > First of all thanks for the detailed suggestions. > > > > 2013/6/12 Rupert Westenthaler <rupert.westentha...@gmail.com> > > > >> Hi Cristian, all > >> > >> really interesting use case! > >> > >> In this mail I will try to give some suggestions on how this could > >> work out. This suggestions are mainly based on experiences and lessons > >> learned in the LIVE [2] project where we built an information system > >> for the Olympic Games in Peking. While this Project excluded the > >> extraction of Events from unstructured text (because the Olympic > >> Information System was already providing event data as XML messages) > >> the semantic search capabilities of this system where very similar as > >> the one described by your use case. > >> > >> IMHO you are not only trying to extract relations, but a formal > >> representation of the situation described by the text. So lets assume > >> that the goal is to Annotate a Setting (or Situation) described in the > >> text - a fise:SettingAnnotation. > >> > >> The DOLCE foundational ontology [1] gives some advices on how to model > >> those. The important relation for modeling this Participation: > >> > >> PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t)) > >> > >> where .. > >> > >> * ED are Endurants (continuants): Endurants do have an identity so we > >> would typically refer to them as Entities referenced by a setting. > >> Note that this includes physical, non-physical as well as > >> social-objects. > >> * PD are Perdurants (occurrents): Perdurants are entities that > >> happen in time. This refers to Events, Activities ... > >> * PC are Participation: It is an time indexed relation where > >> Endurants participate in Perdurants > >> > >> Modeling this in RDF requires to define some intermediate resources > >> because RDF does not allow for n-ary relations. > >> > >> * fise:SettingAnnotation: It is really handy to define one resource > >> being the context for all described data. I would call this > >> "fise:SettingAnnotation" and define it as a sub-concept to > >> fise:Enhancement. All further enhancement about the extracted Setting > >> would define a "fise:in-setting" relation to it. > >> > >> * fise:ParticipantAnnotation: Is used to annotate that Endurant is > >> participating on a setting (fise:in-setting fise:SettingAnnotation). > >> The Endurant itself is described by existing fise:TextAnnotaion (the > >> mentions) and fise:EntityAnnotation (suggested Entities). Basically > >> the fise:ParticipantAnnotation will allow an EnhancementEngine to > >> state that several mentions (in possible different sentences) do > >> represent the same Endurant as participating in the Setting. In > >> addition it would be possible to use the dc:type property (similar as > >> for fise:TextAnnotation) to refer to the role(s) of an participant > >> (e.g. the set: Agent (intensionally performs an action) Cause > >> (unintentionally e.g. a mud slide), Patient (a passive role in an > >> activity) and Instrument (aids an process)), but I am wondering if one > >> could extract those information. > >> > >> * fise:OccurrentAnnotation: is used to annotate a Perdurant in the > >> context of the Setting. Also fise:OccurrentAnnotation can link to > >> fise:TextAnnotaion (typically verbs in the text defining the > >> perdurant) as well as fise:EntityAnnotation suggesting well known > >> Events in a knowledge base (e.g. a Election in a country, or an > >> upraising ...). In addition fise:OccurrentAnnotation can define > >> dc:has-participant links to fise:ParticipantAnnotation. In this case > >> it is explicitly stated hat an Endurant (the > >> fise:ParticipantAnnotation) involved in this Perturant (the > >> fise:OccurrentAnnotation). As Occurrences are temporal indexed this > >> annotation should also support properties for defining the > >> xsd:dateTime for the start/end. > >> > >> > >> Indeed, an event based data structure makes a lot of sense with the > remark > > that you probably won't be able to always extract the date for a given > > setting(situation). > > There are 2 thing which are unclear though. > > > > 1. Perdurant : You could have situations in which the object upon which > the > > Subject ( or Endurant ) is acting is not a transitory object ( such as an > > event, activity ) but rather another Endurant. For example we can have > the > > phrase "USA invades Irak" where "USA" is the Endurant ( Subject ) which > > performs the action of "invading" on another Eundurant, namely "Irak". > > > > By using CAOS, USA would be the Agent and Iraq the Patient. Both are > Endurants. The activity "invading" would be the Perdurant. So ideally > you would have a "fise:SettingAnnotation" with: > > * fise:ParticipantAnnotation for USA with the dc:type caos:Agent, > linking to a fise:TextAnnotation for "USA" and a fise:EntityAnnotation > linking to dbpedia:United_States > * fise:ParticipantAnnotation for Iraq with the dc:type caos:Patient, > linking to a fise:TextAnnotation for "Irak" and a > fise:EntityAnnotation linking to dbpedia:Iraq > * fise:OccurrentAnnotation for "invades" with the dc:type > caos:Activity, linking to a fise:TextAnnotation for "invades" > > > 2. Where does the verb, which links the Subject and the Object come into > > this? I imagined that the Endurant would have a dc:"property" where the > > property = verb which links to the Object in noun form. For example take > > again the sentence "USA invades Irak". You would have the "USA" Entity > with > > dc:invader which points to the Object "Irak". The Endurant would have as > > many dc:"property" elements as there are verbs which link it to an > Object. > > As explained above you would have a fise:OccurrentAnnotation that > represents the Perdurant. The information that the activity mention in > the text is "invades" would be by linking to a fise:TextAnnotation. If > you can also provide an Ontology for Tasks that defines > "myTasks:invade" the fise:OccurrentAnnotation could also link to an > fise:EntityAnnotation for this concept. > > best > Rupert > > > > > ### Consuming the data: > >> > >> I think this model should be sufficient for use-cases as described by > you. > >> > >> Users would be able to consume data on the setting level. This can be > >> done my simple retrieving all fise:ParticipantAnnotation as well as > >> fise:OccurrentAnnotation linked with a setting. BTW this was the > >> approach used in LIVE [2] for semantic search. It allows queries for > >> Settings that involve specific Entities e.g. you could filter for > >> Settings that involve a {Person}, activities:Arrested and a specific > >> {Upraising}. However note that with this approach you will get results > >> for Setting where the {Person} participated and an other person was > >> arrested. > >> > >> An other possibility would be to process enhancement results on the > >> fise:OccurrentAnnotation. This would allow to a much higher > >> granularity level (e.g. it would allow to correctly answer the query > >> used as an example above). But I am wondering if the quality of the > >> Setting extraction will be sufficient for this. I have also doubts if > >> this can be still realized by using semantic indexing to Apache Solr > >> or if it would be better/necessary to store results in a TripleStore > >> and using SPARQL for retrieval. > >> > >> The methodology and query language used by YAGO [3] is also very > >> relevant for this (especially note chapter 7 SPOTL(X) Representation). > >> > >> An other related Topic is the enrichment of Entities (especially > >> Events) in knowledge bases based on Settings extracted form Documents. > >> As per definition - in DOLCE - Perdurants are temporal indexed. That > >> means that at the time when added to a knowledge base they might still > >> be in process. So the creation, enriching and refinement of such > >> Entities in a the knowledge base seams to be critical for a System > >> like described in your use-case. > >> > >> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca > >> <cristian.petro...@gmail.com> wrote: > >> > > >> > First of all I have to mention that I am new in the field of semantic > >> > technologies, I've started to read about them in the last 4-5 > >> months.Having > >> > said that I have a high level overview of what is a good approach to > >> solve > >> > this problem. There are a number of papers on the internet which > describe > >> > what steps need to be taken such as : named entity recognition, > >> > co-reference resolution, pos tagging and others. > >> > >> The Stanbol NLP processing module currently only supports sentence > >> detection, tokenization, POS tagging, Chunking, NER and lemma. support > >> for co-reference resolution and dependency trees is currently missing. > >> > >> Stanford NLP is already integrated with Stanbol [4]. At the moment it > >> only supports English, but I do already work to include the other > >> supported languages. Other NLP framework that is already integrated > >> with Stanbol are Freeling [5] and Talismane [6]. But note that for all > >> those the integration excludes support for co-reference and dependency > >> trees. > >> > >> Anyways I am confident that one can implement a first prototype by > >> only using Sentences and POS tags and - if available - Chunks (e.g. > >> Noun phrases). > >> > >> > > I assume that in the Stanbol context, a feature like Relation extraction > > would be implemented as an EnhancementEngine? > > What kind of effort would be required for a co-reference resolution tool > > integration into Stanbol? > > > > Yes in the end it would be an EnhancementEngine. But before we can > build such an engine we would need to > > * extend the Stanbol NLP processing API with Annotations for co-reference > * add support for JSON Serialisation/Parsing for those annotation so > that the RESTful NLP Analysis Service can provide co-reference > information > > > At this moment I'll be focusing on 2 aspects: > > > > 1. Determine the best data structure to encapsulate the extracted > > information. I'll take a closer look at Dolce. > > Don't make to to complex. Defining a proper structure to represent > Events will only pay-off if we can also successfully extract such > information form processed texts. > > I would start with > > * fise:SettingAnnotation > * {fise:Enhancement} metadata > > * fise:ParticipantAnnotation > * {fise:Enhancement} metadata > * fise:inSetting {settingAnnotation} > * fise:hasMention {textAnnotation} > * fise:suggestion {entityAnnotation} (multiple if there are more > suggestions) > * dc:type one of fise:Agent, fise:Patient, fise:Instrument, fise:Cause > > * fise:OccurrentAnnotation > * {fise:Enhancement} metadata > * fise:inSetting {settingAnnotation} > * fise:hasMention {textAnnotation} > * dc:type set to fise:Activity > > If it turns out that we can extract more, we can add more structure to > those annotations. We might also think about using an own namespace > for those extensions to the annotation structure. > > > 2. Determine how should all of this be integrated into Stanbol. > > Just create an EventExtractionEngine and configure a enhancement chain > that does NLP processing and EntityLinking. > > You should have a look at > > * SentimentSummarizationEngine [1] as it does a lot of things with NLP > processing results (e.g. connecting adjectives (via verbs) to > nouns/pronouns. So as long we can not use explicit dependency trees > you code will need to do similar things with Nouns, Pronouns and > Verbs. > > * Disambigutation-MLT engine, as it creates a Java representation of > present fise:TextAnnotation and fise:EntityAnnotation [2]. Something > similar will also be required by the EventExtractionEngine for fast > access to such annotations while iterating over the Sentences of the > text. > > > best > Rupert > > [1] > https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java > [2] > https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java > > > > > Thanks > > > > Hope this helps to bootstrap this discussion > >> best > >> Rupert > >> > >> -- > >> | Rupert Westenthaler rupert.westentha...@gmail.com > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >