Hi Cristian, I created the branch at
http://svn.apache.org/repos/asf/stanbol/branches/nlp-dep-tree-and-co-ref/ ATM in contains only the "nlp" and "nlp-json" module. Let me know if you would like to have more best Rupert On Thu, Jul 4, 2013 at 10:14 AM, Cristian Petroaca <cristian.petro...@gmail.com> wrote: > Hi Rupert, > > I created jiras : https://issues.apache.org/jira/browse/STANBOL-1132 and > https://issues.apache.org/jira/browse/STANBOL-1133. The original one in > dependent upon these. > Please let me know when I can start using the branch. > > Thanks, > Cristian > > > 2013/6/27 Cristian Petroaca <cristian.petro...@gmail.com> > >> >> >> >> 2013/6/27 Rupert Westenthaler <rupert.westentha...@gmail.com> >> >>> On Thu, Jun 27, 2013 at 3:12 PM, Cristian Petroaca >>> <cristian.petro...@gmail.com> wrote: >>> > Sorry, I meant the Stanbol NLP API, not Stanford in my previous e-mail. >>> By >>> > the way, does Open NLP have the ability to build dependency trees? >>> > >>> >>> AFAIK OpenNLP does not provide this feature. >>> >> >> Then , since the Stanford NLP lib is also integrated into Stanbol, I'll >> take a look at how I can extend its integration to include the dependency >> tree feature. >> >>> >>> >> > >>> > 2013/6/23 Cristian Petroaca <cristian.petro...@gmail.com> >>> > >>> >> Hi Rupert, >>> >> >>> >> I created jira https://issues.apache.org/jira/browse/STANBOL-1121. >>> >> As you suggested I would start with extending the Stanford NLP with >>> >> co-reference resolution but I think also with dependency trees because >>> I >>> >> also need to know the Subject of the sentence and the object that it >>> >> affects, right? >>> >> >>> >> Given that I need to extend the Stanford NLP API in Stanbol for >>> >> co-reference and dependency trees, how do I proceed with this? Do I >>> create >>> >> 2 new sub-tasks to the already opened Jira? After that can I start >>> >> implementing on my local copy of Stanbol and when I'm done I'll send >>> you >>> >> guys the patch fo review? >>> >> >>> >>> I would create two "New Feature" type Issues one for adding support >>> for "dependency trees" and the other for "co-reference" support. You >>> should also define "depends on" relations between STANBOL-1121 and >>> those two new issues. >>> >>> Sub-task could also work, but as adding those features would be also >>> interesting for other things I would rather define them as separate >>> issues. >>> >>> >> 2 New Features connected with the original jira it is then. >> >> >>> If you would prefer to work in an own branch please tell me. This >>> could have the advantage that patches would not be affected by changes >>> in the trunk. >>> >>> Yes, a separate branch sounds good. >> >> best >>> Rupert >>> >>> >> Regards, >>> >> Cristian >>> >> >>> >> >>> >> 2013/6/18 Rupert Westenthaler <rupert.westentha...@gmail.com> >>> >> >>> >>> On Mon, Jun 17, 2013 at 10:18 PM, Cristian Petroaca >>> >>> <cristian.petro...@gmail.com> wrote: >>> >>> > Hi Rupert, >>> >>> > >>> >>> > Agreed on the >>> SettingAnnotation/ParticipantAnnotation/OccurentAnnotation >>> >>> > data structure. >>> >>> > >>> >>> > Should I open up a Jira for all of this in order to encapsulate this >>> >>> > information and establish the goals and these initial steps towards >>> >>> these >>> >>> > goals? >>> >>> >>> >>> Yes please. A JIRA issue for this work would be great. >>> >>> >>> >>> > How should I proceed further? Should I create some design documents >>> that >>> >>> > need to be reviewed? >>> >>> >>> >>> Usually it is the best to write design related text directly in JIRA >>> >>> by using Markdown [1] syntax. This will allow us later to use this >>> >>> text directly for the documentation on the Stanbol Webpage. >>> >>> >>> >>> best >>> >>> Rupert >>> >>> >>> >>> >>> >>> [1] http://daringfireball.net/projects/markdown/ >>> >>> > >>> >>> > Regards, >>> >>> > Cristian >>> >>> > >>> >>> > >>> >>> > 2013/6/17 Rupert Westenthaler <rupert.westentha...@gmail.com> >>> >>> > >>> >>> >> On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca >>> >>> >> <cristian.petro...@gmail.com> wrote: >>> >>> >> > HI Rupert, >>> >>> >> > >>> >>> >> > First of all thanks for the detailed suggestions. >>> >>> >> > >>> >>> >> > 2013/6/12 Rupert Westenthaler <rupert.westentha...@gmail.com> >>> >>> >> > >>> >>> >> >> Hi Cristian, all >>> >>> >> >> >>> >>> >> >> really interesting use case! >>> >>> >> >> >>> >>> >> >> In this mail I will try to give some suggestions on how this >>> could >>> >>> >> >> work out. This suggestions are mainly based on experiences and >>> >>> lessons >>> >>> >> >> learned in the LIVE [2] project where we built an information >>> system >>> >>> >> >> for the Olympic Games in Peking. While this Project excluded the >>> >>> >> >> extraction of Events from unstructured text (because the Olympic >>> >>> >> >> Information System was already providing event data as XML >>> messages) >>> >>> >> >> the semantic search capabilities of this system where very >>> similar >>> >>> as >>> >>> >> >> the one described by your use case. >>> >>> >> >> >>> >>> >> >> IMHO you are not only trying to extract relations, but a formal >>> >>> >> >> representation of the situation described by the text. So lets >>> >>> assume >>> >>> >> >> that the goal is to Annotate a Setting (or Situation) described >>> in >>> >>> the >>> >>> >> >> text - a fise:SettingAnnotation. >>> >>> >> >> >>> >>> >> >> The DOLCE foundational ontology [1] gives some advices on how to >>> >>> model >>> >>> >> >> those. The important relation for modeling this Participation: >>> >>> >> >> >>> >>> >> >> PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t)) >>> >>> >> >> >>> >>> >> >> where .. >>> >>> >> >> >>> >>> >> >> * ED are Endurants (continuants): Endurants do have an >>> identity so >>> >>> we >>> >>> >> >> would typically refer to them as Entities referenced by a >>> setting. >>> >>> >> >> Note that this includes physical, non-physical as well as >>> >>> >> >> social-objects. >>> >>> >> >> * PD are Perdurants (occurrents): Perdurants are entities that >>> >>> >> >> happen in time. This refers to Events, Activities ... >>> >>> >> >> * PC are Participation: It is an time indexed relation where >>> >>> >> >> Endurants participate in Perdurants >>> >>> >> >> >>> >>> >> >> Modeling this in RDF requires to define some intermediate >>> resources >>> >>> >> >> because RDF does not allow for n-ary relations. >>> >>> >> >> >>> >>> >> >> * fise:SettingAnnotation: It is really handy to define one >>> resource >>> >>> >> >> being the context for all described data. I would call this >>> >>> >> >> "fise:SettingAnnotation" and define it as a sub-concept to >>> >>> >> >> fise:Enhancement. All further enhancement about the extracted >>> >>> Setting >>> >>> >> >> would define a "fise:in-setting" relation to it. >>> >>> >> >> >>> >>> >> >> * fise:ParticipantAnnotation: Is used to annotate that >>> Endurant is >>> >>> >> >> participating on a setting (fise:in-setting >>> fise:SettingAnnotation). >>> >>> >> >> The Endurant itself is described by existing fise:TextAnnotaion >>> (the >>> >>> >> >> mentions) and fise:EntityAnnotation (suggested Entities). >>> Basically >>> >>> >> >> the fise:ParticipantAnnotation will allow an EnhancementEngine >>> to >>> >>> >> >> state that several mentions (in possible different sentences) do >>> >>> >> >> represent the same Endurant as participating in the Setting. In >>> >>> >> >> addition it would be possible to use the dc:type property >>> (similar >>> >>> as >>> >>> >> >> for fise:TextAnnotation) to refer to the role(s) of an >>> participant >>> >>> >> >> (e.g. the set: Agent (intensionally performs an action) Cause >>> >>> >> >> (unintentionally e.g. a mud slide), Patient (a passive role in >>> an >>> >>> >> >> activity) and Instrument (aids an process)), but I am wondering >>> if >>> >>> one >>> >>> >> >> could extract those information. >>> >>> >> >> >>> >>> >> >> * fise:OccurrentAnnotation: is used to annotate a Perdurant in >>> the >>> >>> >> >> context of the Setting. Also fise:OccurrentAnnotation can link >>> to >>> >>> >> >> fise:TextAnnotaion (typically verbs in the text defining the >>> >>> >> >> perdurant) as well as fise:EntityAnnotation suggesting well >>> known >>> >>> >> >> Events in a knowledge base (e.g. a Election in a country, or an >>> >>> >> >> upraising ...). In addition fise:OccurrentAnnotation can define >>> >>> >> >> dc:has-participant links to fise:ParticipantAnnotation. In this >>> case >>> >>> >> >> it is explicitly stated hat an Endurant (the >>> >>> >> >> fise:ParticipantAnnotation) involved in this Perturant (the >>> >>> >> >> fise:OccurrentAnnotation). As Occurrences are temporal indexed >>> this >>> >>> >> >> annotation should also support properties for defining the >>> >>> >> >> xsd:dateTime for the start/end. >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> Indeed, an event based data structure makes a lot of sense with >>> the >>> >>> >> remark >>> >>> >> > that you probably won't be able to always extract the date for a >>> >>> given >>> >>> >> > setting(situation). >>> >>> >> > There are 2 thing which are unclear though. >>> >>> >> > >>> >>> >> > 1. Perdurant : You could have situations in which the object upon >>> >>> which >>> >>> >> the >>> >>> >> > Subject ( or Endurant ) is acting is not a transitory object ( >>> such >>> >>> as an >>> >>> >> > event, activity ) but rather another Endurant. For example we can >>> >>> have >>> >>> >> the >>> >>> >> > phrase "USA invades Irak" where "USA" is the Endurant ( Subject ) >>> >>> which >>> >>> >> > performs the action of "invading" on another Eundurant, namely >>> >>> "Irak". >>> >>> >> > >>> >>> >> >>> >>> >> By using CAOS, USA would be the Agent and Iraq the Patient. Both >>> are >>> >>> >> Endurants. The activity "invading" would be the Perdurant. So >>> ideally >>> >>> >> you would have a "fise:SettingAnnotation" with: >>> >>> >> >>> >>> >> * fise:ParticipantAnnotation for USA with the dc:type caos:Agent, >>> >>> >> linking to a fise:TextAnnotation for "USA" and a >>> fise:EntityAnnotation >>> >>> >> linking to dbpedia:United_States >>> >>> >> * fise:ParticipantAnnotation for Iraq with the dc:type >>> caos:Patient, >>> >>> >> linking to a fise:TextAnnotation for "Irak" and a >>> >>> >> fise:EntityAnnotation linking to dbpedia:Iraq >>> >>> >> * fise:OccurrentAnnotation for "invades" with the dc:type >>> >>> >> caos:Activity, linking to a fise:TextAnnotation for "invades" >>> >>> >> >>> >>> >> > 2. Where does the verb, which links the Subject and the Object >>> come >>> >>> into >>> >>> >> > this? I imagined that the Endurant would have a dc:"property" >>> where >>> >>> the >>> >>> >> > property = verb which links to the Object in noun form. For >>> example >>> >>> take >>> >>> >> > again the sentence "USA invades Irak". You would have the "USA" >>> >>> Entity >>> >>> >> with >>> >>> >> > dc:invader which points to the Object "Irak". The Endurant would >>> >>> have as >>> >>> >> > many dc:"property" elements as there are verbs which link it to >>> an >>> >>> >> Object. >>> >>> >> >>> >>> >> As explained above you would have a fise:OccurrentAnnotation that >>> >>> >> represents the Perdurant. The information that the activity >>> mention in >>> >>> >> the text is "invades" would be by linking to a >>> fise:TextAnnotation. If >>> >>> >> you can also provide an Ontology for Tasks that defines >>> >>> >> "myTasks:invade" the fise:OccurrentAnnotation could also link to an >>> >>> >> fise:EntityAnnotation for this concept. >>> >>> >> >>> >>> >> best >>> >>> >> Rupert >>> >>> >> >>> >>> >> > >>> >>> >> > ### Consuming the data: >>> >>> >> >> >>> >>> >> >> I think this model should be sufficient for use-cases as >>> described >>> >>> by >>> >>> >> you. >>> >>> >> >> >>> >>> >> >> Users would be able to consume data on the setting level. This >>> can >>> >>> be >>> >>> >> >> done my simple retrieving all fise:ParticipantAnnotation as >>> well as >>> >>> >> >> fise:OccurrentAnnotation linked with a setting. BTW this was the >>> >>> >> >> approach used in LIVE [2] for semantic search. It allows >>> queries for >>> >>> >> >> Settings that involve specific Entities e.g. you could filter >>> for >>> >>> >> >> Settings that involve a {Person}, activities:Arrested and a >>> specific >>> >>> >> >> {Upraising}. However note that with this approach you will get >>> >>> results >>> >>> >> >> for Setting where the {Person} participated and an other person >>> was >>> >>> >> >> arrested. >>> >>> >> >> >>> >>> >> >> An other possibility would be to process enhancement results on >>> the >>> >>> >> >> fise:OccurrentAnnotation. This would allow to a much higher >>> >>> >> >> granularity level (e.g. it would allow to correctly answer the >>> query >>> >>> >> >> used as an example above). But I am wondering if the quality of >>> the >>> >>> >> >> Setting extraction will be sufficient for this. I have also >>> doubts >>> >>> if >>> >>> >> >> this can be still realized by using semantic indexing to Apache >>> Solr >>> >>> >> >> or if it would be better/necessary to store results in a >>> TripleStore >>> >>> >> >> and using SPARQL for retrieval. >>> >>> >> >> >>> >>> >> >> The methodology and query language used by YAGO [3] is also very >>> >>> >> >> relevant for this (especially note chapter 7 SPOTL(X) >>> >>> Representation). >>> >>> >> >> >>> >>> >> >> An other related Topic is the enrichment of Entities (especially >>> >>> >> >> Events) in knowledge bases based on Settings extracted form >>> >>> Documents. >>> >>> >> >> As per definition - in DOLCE - Perdurants are temporal indexed. >>> That >>> >>> >> >> means that at the time when added to a knowledge base they might >>> >>> still >>> >>> >> >> be in process. So the creation, enriching and refinement of such >>> >>> >> >> Entities in a the knowledge base seams to be critical for a >>> System >>> >>> >> >> like described in your use-case. >>> >>> >> >> >>> >>> >> >> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca >>> >>> >> >> <cristian.petro...@gmail.com> wrote: >>> >>> >> >> > >>> >>> >> >> > First of all I have to mention that I am new in the field of >>> >>> semantic >>> >>> >> >> > technologies, I've started to read about them in the last 4-5 >>> >>> >> >> months.Having >>> >>> >> >> > said that I have a high level overview of what is a good >>> approach >>> >>> to >>> >>> >> >> solve >>> >>> >> >> > this problem. There are a number of papers on the internet >>> which >>> >>> >> describe >>> >>> >> >> > what steps need to be taken such as : named entity >>> recognition, >>> >>> >> >> > co-reference resolution, pos tagging and others. >>> >>> >> >> >>> >>> >> >> The Stanbol NLP processing module currently only supports >>> sentence >>> >>> >> >> detection, tokenization, POS tagging, Chunking, NER and lemma. >>> >>> support >>> >>> >> >> for co-reference resolution and dependency trees is currently >>> >>> missing. >>> >>> >> >> >>> >>> >> >> Stanford NLP is already integrated with Stanbol [4]. At the >>> moment >>> >>> it >>> >>> >> >> only supports English, but I do already work to include the >>> other >>> >>> >> >> supported languages. Other NLP framework that is already >>> integrated >>> >>> >> >> with Stanbol are Freeling [5] and Talismane [6]. But note that >>> for >>> >>> all >>> >>> >> >> those the integration excludes support for co-reference and >>> >>> dependency >>> >>> >> >> trees. >>> >>> >> >> >>> >>> >> >> Anyways I am confident that one can implement a first prototype >>> by >>> >>> >> >> only using Sentences and POS tags and - if available - Chunks >>> (e.g. >>> >>> >> >> Noun phrases). >>> >>> >> >> >>> >>> >> >> >>> >>> >> > I assume that in the Stanbol context, a feature like Relation >>> >>> extraction >>> >>> >> > would be implemented as an EnhancementEngine? >>> >>> >> > What kind of effort would be required for a co-reference >>> resolution >>> >>> tool >>> >>> >> > integration into Stanbol? >>> >>> >> > >>> >>> >> >>> >>> >> Yes in the end it would be an EnhancementEngine. But before we can >>> >>> >> build such an engine we would need to >>> >>> >> >>> >>> >> * extend the Stanbol NLP processing API with Annotations for >>> >>> co-reference >>> >>> >> * add support for JSON Serialisation/Parsing for those annotation >>> so >>> >>> >> that the RESTful NLP Analysis Service can provide co-reference >>> >>> >> information >>> >>> >> >>> >>> >> > At this moment I'll be focusing on 2 aspects: >>> >>> >> > >>> >>> >> > 1. Determine the best data structure to encapsulate the extracted >>> >>> >> > information. I'll take a closer look at Dolce. >>> >>> >> >>> >>> >> Don't make to to complex. Defining a proper structure to represent >>> >>> >> Events will only pay-off if we can also successfully extract such >>> >>> >> information form processed texts. >>> >>> >> >>> >>> >> I would start with >>> >>> >> >>> >>> >> * fise:SettingAnnotation >>> >>> >> * {fise:Enhancement} metadata >>> >>> >> >>> >>> >> * fise:ParticipantAnnotation >>> >>> >> * {fise:Enhancement} metadata >>> >>> >> * fise:inSetting {settingAnnotation} >>> >>> >> * fise:hasMention {textAnnotation} >>> >>> >> * fise:suggestion {entityAnnotation} (multiple if there are >>> more >>> >>> >> suggestions) >>> >>> >> * dc:type one of fise:Agent, fise:Patient, fise:Instrument, >>> >>> fise:Cause >>> >>> >> >>> >>> >> * fise:OccurrentAnnotation >>> >>> >> * {fise:Enhancement} metadata >>> >>> >> * fise:inSetting {settingAnnotation} >>> >>> >> * fise:hasMention {textAnnotation} >>> >>> >> * dc:type set to fise:Activity >>> >>> >> >>> >>> >> If it turns out that we can extract more, we can add more >>> structure to >>> >>> >> those annotations. We might also think about using an own namespace >>> >>> >> for those extensions to the annotation structure. >>> >>> >> >>> >>> >> > 2. Determine how should all of this be integrated into Stanbol. >>> >>> >> >>> >>> >> Just create an EventExtractionEngine and configure a enhancement >>> chain >>> >>> >> that does NLP processing and EntityLinking. >>> >>> >> >>> >>> >> You should have a look at >>> >>> >> >>> >>> >> * SentimentSummarizationEngine [1] as it does a lot of things with >>> NLP >>> >>> >> processing results (e.g. connecting adjectives (via verbs) to >>> >>> >> nouns/pronouns. So as long we can not use explicit dependency trees >>> >>> >> you code will need to do similar things with Nouns, Pronouns and >>> >>> >> Verbs. >>> >>> >> >>> >>> >> * Disambigutation-MLT engine, as it creates a Java representation >>> of >>> >>> >> present fise:TextAnnotation and fise:EntityAnnotation [2]. >>> Something >>> >>> >> similar will also be required by the EventExtractionEngine for fast >>> >>> >> access to such annotations while iterating over the Sentences of >>> the >>> >>> >> text. >>> >>> >> >>> >>> >> >>> >>> >> best >>> >>> >> Rupert >>> >>> >> >>> >>> >> [1] >>> >>> >> >>> >>> >>> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java >>> >>> >> [2] >>> >>> >> >>> >>> >>> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java >>> >>> >> >>> >>> >> > >>> >>> >> > Thanks >>> >>> >> > >>> >>> >> > Hope this helps to bootstrap this discussion >>> >>> >> >> best >>> >>> >> >> Rupert >>> >>> >> >> >>> >>> >> >> -- >>> >>> >> >> | Rupert Westenthaler rupert.westentha...@gmail.com >>> >>> >> >> | Bodenlehenstraße 11 >>> ++43-699-11108907 >>> >>> >> >> | A-5500 Bischofshofen >>> >>> >> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >> -- >>> >>> >> | Rupert Westenthaler rupert.westentha...@gmail.com >>> >>> >> | Bodenlehenstraße 11 >>> ++43-699-11108907 >>> >>> >> | A-5500 Bischofshofen >>> >>> >> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> >>> | A-5500 Bischofshofen >>> >>> >>> >> >>> >> >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> >> >> -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen