On Mon, Jun 17, 2013 at 10:18 PM, Cristian Petroaca <cristian.petro...@gmail.com> wrote: > Hi Rupert, > > Agreed on the SettingAnnotation/ParticipantAnnotation/OccurentAnnotation > data structure. > > Should I open up a Jira for all of this in order to encapsulate this > information and establish the goals and these initial steps towards these > goals?
Yes please. A JIRA issue for this work would be great. > How should I proceed further? Should I create some design documents that > need to be reviewed? Usually it is the best to write design related text directly in JIRA by using Markdown [1] syntax. This will allow us later to use this text directly for the documentation on the Stanbol Webpage. best Rupert [1] http://daringfireball.net/projects/markdown/ > > Regards, > Cristian > > > 2013/6/17 Rupert Westenthaler <rupert.westentha...@gmail.com> > >> On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca >> <cristian.petro...@gmail.com> wrote: >> > HI Rupert, >> > >> > First of all thanks for the detailed suggestions. >> > >> > 2013/6/12 Rupert Westenthaler <rupert.westentha...@gmail.com> >> > >> >> Hi Cristian, all >> >> >> >> really interesting use case! >> >> >> >> In this mail I will try to give some suggestions on how this could >> >> work out. This suggestions are mainly based on experiences and lessons >> >> learned in the LIVE [2] project where we built an information system >> >> for the Olympic Games in Peking. While this Project excluded the >> >> extraction of Events from unstructured text (because the Olympic >> >> Information System was already providing event data as XML messages) >> >> the semantic search capabilities of this system where very similar as >> >> the one described by your use case. >> >> >> >> IMHO you are not only trying to extract relations, but a formal >> >> representation of the situation described by the text. So lets assume >> >> that the goal is to Annotate a Setting (or Situation) described in the >> >> text - a fise:SettingAnnotation. >> >> >> >> The DOLCE foundational ontology [1] gives some advices on how to model >> >> those. The important relation for modeling this Participation: >> >> >> >> PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t)) >> >> >> >> where .. >> >> >> >> * ED are Endurants (continuants): Endurants do have an identity so we >> >> would typically refer to them as Entities referenced by a setting. >> >> Note that this includes physical, non-physical as well as >> >> social-objects. >> >> * PD are Perdurants (occurrents): Perdurants are entities that >> >> happen in time. This refers to Events, Activities ... >> >> * PC are Participation: It is an time indexed relation where >> >> Endurants participate in Perdurants >> >> >> >> Modeling this in RDF requires to define some intermediate resources >> >> because RDF does not allow for n-ary relations. >> >> >> >> * fise:SettingAnnotation: It is really handy to define one resource >> >> being the context for all described data. I would call this >> >> "fise:SettingAnnotation" and define it as a sub-concept to >> >> fise:Enhancement. All further enhancement about the extracted Setting >> >> would define a "fise:in-setting" relation to it. >> >> >> >> * fise:ParticipantAnnotation: Is used to annotate that Endurant is >> >> participating on a setting (fise:in-setting fise:SettingAnnotation). >> >> The Endurant itself is described by existing fise:TextAnnotaion (the >> >> mentions) and fise:EntityAnnotation (suggested Entities). Basically >> >> the fise:ParticipantAnnotation will allow an EnhancementEngine to >> >> state that several mentions (in possible different sentences) do >> >> represent the same Endurant as participating in the Setting. In >> >> addition it would be possible to use the dc:type property (similar as >> >> for fise:TextAnnotation) to refer to the role(s) of an participant >> >> (e.g. the set: Agent (intensionally performs an action) Cause >> >> (unintentionally e.g. a mud slide), Patient (a passive role in an >> >> activity) and Instrument (aids an process)), but I am wondering if one >> >> could extract those information. >> >> >> >> * fise:OccurrentAnnotation: is used to annotate a Perdurant in the >> >> context of the Setting. Also fise:OccurrentAnnotation can link to >> >> fise:TextAnnotaion (typically verbs in the text defining the >> >> perdurant) as well as fise:EntityAnnotation suggesting well known >> >> Events in a knowledge base (e.g. a Election in a country, or an >> >> upraising ...). In addition fise:OccurrentAnnotation can define >> >> dc:has-participant links to fise:ParticipantAnnotation. In this case >> >> it is explicitly stated hat an Endurant (the >> >> fise:ParticipantAnnotation) involved in this Perturant (the >> >> fise:OccurrentAnnotation). As Occurrences are temporal indexed this >> >> annotation should also support properties for defining the >> >> xsd:dateTime for the start/end. >> >> >> >> >> >> Indeed, an event based data structure makes a lot of sense with the >> remark >> > that you probably won't be able to always extract the date for a given >> > setting(situation). >> > There are 2 thing which are unclear though. >> > >> > 1. Perdurant : You could have situations in which the object upon which >> the >> > Subject ( or Endurant ) is acting is not a transitory object ( such as an >> > event, activity ) but rather another Endurant. For example we can have >> the >> > phrase "USA invades Irak" where "USA" is the Endurant ( Subject ) which >> > performs the action of "invading" on another Eundurant, namely "Irak". >> > >> >> By using CAOS, USA would be the Agent and Iraq the Patient. Both are >> Endurants. The activity "invading" would be the Perdurant. So ideally >> you would have a "fise:SettingAnnotation" with: >> >> * fise:ParticipantAnnotation for USA with the dc:type caos:Agent, >> linking to a fise:TextAnnotation for "USA" and a fise:EntityAnnotation >> linking to dbpedia:United_States >> * fise:ParticipantAnnotation for Iraq with the dc:type caos:Patient, >> linking to a fise:TextAnnotation for "Irak" and a >> fise:EntityAnnotation linking to dbpedia:Iraq >> * fise:OccurrentAnnotation for "invades" with the dc:type >> caos:Activity, linking to a fise:TextAnnotation for "invades" >> >> > 2. Where does the verb, which links the Subject and the Object come into >> > this? I imagined that the Endurant would have a dc:"property" where the >> > property = verb which links to the Object in noun form. For example take >> > again the sentence "USA invades Irak". You would have the "USA" Entity >> with >> > dc:invader which points to the Object "Irak". The Endurant would have as >> > many dc:"property" elements as there are verbs which link it to an >> Object. >> >> As explained above you would have a fise:OccurrentAnnotation that >> represents the Perdurant. The information that the activity mention in >> the text is "invades" would be by linking to a fise:TextAnnotation. If >> you can also provide an Ontology for Tasks that defines >> "myTasks:invade" the fise:OccurrentAnnotation could also link to an >> fise:EntityAnnotation for this concept. >> >> best >> Rupert >> >> > >> > ### Consuming the data: >> >> >> >> I think this model should be sufficient for use-cases as described by >> you. >> >> >> >> Users would be able to consume data on the setting level. This can be >> >> done my simple retrieving all fise:ParticipantAnnotation as well as >> >> fise:OccurrentAnnotation linked with a setting. BTW this was the >> >> approach used in LIVE [2] for semantic search. It allows queries for >> >> Settings that involve specific Entities e.g. you could filter for >> >> Settings that involve a {Person}, activities:Arrested and a specific >> >> {Upraising}. However note that with this approach you will get results >> >> for Setting where the {Person} participated and an other person was >> >> arrested. >> >> >> >> An other possibility would be to process enhancement results on the >> >> fise:OccurrentAnnotation. This would allow to a much higher >> >> granularity level (e.g. it would allow to correctly answer the query >> >> used as an example above). But I am wondering if the quality of the >> >> Setting extraction will be sufficient for this. I have also doubts if >> >> this can be still realized by using semantic indexing to Apache Solr >> >> or if it would be better/necessary to store results in a TripleStore >> >> and using SPARQL for retrieval. >> >> >> >> The methodology and query language used by YAGO [3] is also very >> >> relevant for this (especially note chapter 7 SPOTL(X) Representation). >> >> >> >> An other related Topic is the enrichment of Entities (especially >> >> Events) in knowledge bases based on Settings extracted form Documents. >> >> As per definition - in DOLCE - Perdurants are temporal indexed. That >> >> means that at the time when added to a knowledge base they might still >> >> be in process. So the creation, enriching and refinement of such >> >> Entities in a the knowledge base seams to be critical for a System >> >> like described in your use-case. >> >> >> >> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca >> >> <cristian.petro...@gmail.com> wrote: >> >> > >> >> > First of all I have to mention that I am new in the field of semantic >> >> > technologies, I've started to read about them in the last 4-5 >> >> months.Having >> >> > said that I have a high level overview of what is a good approach to >> >> solve >> >> > this problem. There are a number of papers on the internet which >> describe >> >> > what steps need to be taken such as : named entity recognition, >> >> > co-reference resolution, pos tagging and others. >> >> >> >> The Stanbol NLP processing module currently only supports sentence >> >> detection, tokenization, POS tagging, Chunking, NER and lemma. support >> >> for co-reference resolution and dependency trees is currently missing. >> >> >> >> Stanford NLP is already integrated with Stanbol [4]. At the moment it >> >> only supports English, but I do already work to include the other >> >> supported languages. Other NLP framework that is already integrated >> >> with Stanbol are Freeling [5] and Talismane [6]. But note that for all >> >> those the integration excludes support for co-reference and dependency >> >> trees. >> >> >> >> Anyways I am confident that one can implement a first prototype by >> >> only using Sentences and POS tags and - if available - Chunks (e.g. >> >> Noun phrases). >> >> >> >> >> > I assume that in the Stanbol context, a feature like Relation extraction >> > would be implemented as an EnhancementEngine? >> > What kind of effort would be required for a co-reference resolution tool >> > integration into Stanbol? >> > >> >> Yes in the end it would be an EnhancementEngine. But before we can >> build such an engine we would need to >> >> * extend the Stanbol NLP processing API with Annotations for co-reference >> * add support for JSON Serialisation/Parsing for those annotation so >> that the RESTful NLP Analysis Service can provide co-reference >> information >> >> > At this moment I'll be focusing on 2 aspects: >> > >> > 1. Determine the best data structure to encapsulate the extracted >> > information. I'll take a closer look at Dolce. >> >> Don't make to to complex. Defining a proper structure to represent >> Events will only pay-off if we can also successfully extract such >> information form processed texts. >> >> I would start with >> >> * fise:SettingAnnotation >> * {fise:Enhancement} metadata >> >> * fise:ParticipantAnnotation >> * {fise:Enhancement} metadata >> * fise:inSetting {settingAnnotation} >> * fise:hasMention {textAnnotation} >> * fise:suggestion {entityAnnotation} (multiple if there are more >> suggestions) >> * dc:type one of fise:Agent, fise:Patient, fise:Instrument, fise:Cause >> >> * fise:OccurrentAnnotation >> * {fise:Enhancement} metadata >> * fise:inSetting {settingAnnotation} >> * fise:hasMention {textAnnotation} >> * dc:type set to fise:Activity >> >> If it turns out that we can extract more, we can add more structure to >> those annotations. We might also think about using an own namespace >> for those extensions to the annotation structure. >> >> > 2. Determine how should all of this be integrated into Stanbol. >> >> Just create an EventExtractionEngine and configure a enhancement chain >> that does NLP processing and EntityLinking. >> >> You should have a look at >> >> * SentimentSummarizationEngine [1] as it does a lot of things with NLP >> processing results (e.g. connecting adjectives (via verbs) to >> nouns/pronouns. So as long we can not use explicit dependency trees >> you code will need to do similar things with Nouns, Pronouns and >> Verbs. >> >> * Disambigutation-MLT engine, as it creates a Java representation of >> present fise:TextAnnotation and fise:EntityAnnotation [2]. Something >> similar will also be required by the EventExtractionEngine for fast >> access to such annotations while iterating over the Sentences of the >> text. >> >> >> best >> Rupert >> >> [1] >> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java >> [2] >> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java >> >> > >> > Thanks >> > >> > Hope this helps to bootstrap this discussion >> >> best >> >> Rupert >> >> >> >> -- >> >> | Rupert Westenthaler rupert.westentha...@gmail.com >> >> | Bodenlehenstraße 11 ++43-699-11108907 >> >> | A-5500 Bischofshofen >> >> >> >> >> >> -- >> | Rupert Westenthaler rupert.westentha...@gmail.com >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen