Re: Relation extraction feature

Cristian Petroaca Mon, 17 Jun 2013 13:26:18 -0700

Hi Rupert,

Agreed on the SettingAnnotation/ParticipantAnnotation/OccurentAnnotation
data structure.


Should I open up a Jira for all of this in order to encapsulate this
information and establish the goals and these initial steps towards these
goals?
How should I proceed further? Should I create some design documents that
need to be reviewed?

Regards,
Cristian


2013/6/17 Rupert Westenthaler <[email protected]>

> On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca
> <[email protected]> wrote:
> > HI Rupert,
> >
> > First of all thanks for the detailed suggestions.
> >
> > 2013/6/12 Rupert Westenthaler <[email protected]>
> >
> >> Hi Cristian, all
> >>
> >> really interesting use case!
> >>
> >> In this mail I will try to give some suggestions on how this could
> >> work out. This suggestions are mainly based on experiences and lessons
> >> learned in the LIVE [2] project where we built an information system
> >> for the Olympic Games in Peking. While this Project excluded the
> >> extraction of Events from unstructured text (because the Olympic
> >> Information System was already providing event data as XML messages)
> >> the semantic search capabilities of this system where very similar as
> >> the one described by your use case.
> >>
> >> IMHO you are not only trying to extract relations, but a formal
> >> representation of the situation described by the text. So lets assume
> >> that the goal is to Annotate a Setting (or Situation) described in the
> >> text - a fise:SettingAnnotation.
> >>
> >> The DOLCE foundational ontology [1] gives some advices on how to model
> >> those. The important relation for modeling this Participation:
> >>
> >>     PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t))
> >>
> >> where ..
> >>
> >>  * ED are Endurants (continuants): Endurants do have an identity so we
> >> would typically refer to them as Entities referenced by a setting.
> >> Note that this includes physical, non-physical as well as
> >> social-objects.
> >>  * PD are Perdurants (occurrents):  Perdurants are entities that
> >> happen in time. This refers to Events, Activities ...
> >>  * PC are Participation: It is an time indexed relation where
> >> Endurants participate in Perdurants
> >>
> >> Modeling this in RDF requires to define some intermediate resources
> >> because RDF does not allow for n-ary relations.
> >>
> >>  * fise:SettingAnnotation: It is really handy to define one resource
> >> being the context for all described data. I would call this
> >> "fise:SettingAnnotation" and define it as a sub-concept to
> >> fise:Enhancement. All further enhancement about the extracted Setting
> >> would define a "fise:in-setting" relation to it.
> >>
> >>  * fise:ParticipantAnnotation: Is used to annotate that Endurant is
> >> participating on a setting (fise:in-setting fise:SettingAnnotation).
> >> The Endurant itself is described by existing fise:TextAnnotaion (the
> >> mentions) and fise:EntityAnnotation (suggested Entities). Basically
> >> the fise:ParticipantAnnotation will allow an EnhancementEngine to
> >> state that several mentions (in possible different sentences) do
> >> represent the same Endurant as participating in the Setting. In
> >> addition it would be possible to use the dc:type property (similar as
> >> for fise:TextAnnotation) to refer to the role(s) of an participant
> >> (e.g. the set: Agent (intensionally performs an action) Cause
> >> (unintentionally e.g. a mud slide), Patient (a passive role in an
> >> activity) and Instrument (aids an process)), but I am wondering if one
> >> could extract those information.
> >>
> >> * fise:OccurrentAnnotation: is used to annotate a Perdurant in the
> >> context of the Setting. Also fise:OccurrentAnnotation can link to
> >> fise:TextAnnotaion (typically verbs in the text defining the
> >> perdurant) as well as fise:EntityAnnotation suggesting well known
> >> Events in a knowledge base (e.g. a Election in a country, or an
> >> upraising ...). In addition fise:OccurrentAnnotation can define
> >> dc:has-participant links to fise:ParticipantAnnotation. In this case
> >> it is explicitly stated hat an Endurant (the
> >> fise:ParticipantAnnotation) involved in this Perturant (the
> >> fise:OccurrentAnnotation). As Occurrences are temporal indexed this
> >> annotation should also support properties for defining the
> >> xsd:dateTime for the start/end.
> >>
> >>
> >> Indeed, an event based data structure makes a lot of sense with the
> remark
> > that you probably won't be able to always extract the date for a given
> > setting(situation).
> > There are 2 thing which are unclear though.
> >
> > 1. Perdurant : You could have situations in which the object upon which
> the
> > Subject ( or Endurant ) is acting is not a transitory object ( such as an
> > event, activity ) but rather another Endurant. For example we can have
> the
> > phrase "USA invades Irak" where "USA" is the Endurant ( Subject ) which
> > performs the action of "invading" on another Eundurant, namely "Irak".
> >
>
> By using CAOS, USA would be the Agent and Iraq the Patient. Both are
> Endurants. The activity "invading" would be the Perdurant. So ideally
> you would have a  "fise:SettingAnnotation" with:
>
>   * fise:ParticipantAnnotation for USA with the dc:type caos:Agent,
> linking to a fise:TextAnnotation for "USA" and a fise:EntityAnnotation
> linking to dbpedia:United_States
>   * fise:ParticipantAnnotation for Iraq with the dc:type caos:Patient,
> linking to a fise:TextAnnotation for "Irak" and a
> fise:EntityAnnotation linking to  dbpedia:Iraq
>   * fise:OccurrentAnnotation for "invades" with the dc:type
> caos:Activity, linking to a fise:TextAnnotation for "invades"
>
> > 2. Where does the verb, which links the Subject and the Object come into
> > this? I imagined that the Endurant would have a dc:"property" where the
> > property = verb which links to the Object in noun form. For example take
> > again the sentence "USA invades Irak". You would have the "USA" Entity
> with
> > dc:invader which points to the Object "Irak". The Endurant would have as
> > many dc:"property" elements as there are verbs which link it to an
> Object.
>
> As explained above you would have a fise:OccurrentAnnotation that
> represents the Perdurant. The information that the activity mention in
> the text is "invades" would be by linking to a fise:TextAnnotation. If
> you can also provide an Ontology for Tasks that defines
> "myTasks:invade" the fise:OccurrentAnnotation could also link to an
> fise:EntityAnnotation for this concept.
>
> best
> Rupert
>
> >
> > ### Consuming the data:
> >>
> >> I think this model should be sufficient for use-cases as described by
> you.
> >>
> >> Users would be able to consume data on the setting level. This can be
> >> done my simple retrieving all fise:ParticipantAnnotation as well as
> >> fise:OccurrentAnnotation linked with a setting. BTW this was the
> >> approach used in LIVE [2] for semantic search. It allows queries for
> >> Settings that involve specific Entities e.g. you could filter for
> >> Settings that involve a {Person}, activities:Arrested and a specific
> >> {Upraising}. However note that with this approach you will get results
> >> for Setting where the {Person} participated and an other person was
> >> arrested.
> >>
> >> An other possibility would be to process enhancement results on the
> >> fise:OccurrentAnnotation. This would allow to a much higher
> >> granularity level (e.g. it would allow to correctly answer the query
> >> used as an example above). But I am wondering if the quality of the
> >> Setting extraction will be sufficient for this. I have also doubts if
> >> this can be still realized by using semantic indexing to Apache Solr
> >> or if it would be better/necessary to store results in a TripleStore
> >> and using SPARQL for retrieval.
> >>
> >> The methodology and query language used by YAGO [3] is also very
> >> relevant for this (especially note chapter 7 SPOTL(X) Representation).
> >>
> >> An other related Topic is the enrichment of Entities (especially
> >> Events) in knowledge bases based on Settings extracted form Documents.
> >> As per definition - in DOLCE - Perdurants are temporal indexed. That
> >> means that at the time when added to a knowledge base they might still
> >> be in process. So the creation, enriching and refinement of such
> >> Entities in a the knowledge base seams to be critical for a System
> >> like described in your use-case.
> >>
> >> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca
> >> <[email protected]> wrote:
> >> >
> >> > First of all I have to mention that I am new in the field of semantic
> >> > technologies, I've started to read about them in the last 4-5
> >> months.Having
> >> > said that I have a high level overview of what is a good approach to
> >> solve
> >> > this problem. There are a number of papers on the internet which
> describe
> >> > what steps need to be taken such as : named entity recognition,
> >> > co-reference resolution, pos tagging and others.
> >>
> >> The Stanbol NLP processing module currently only supports sentence
> >> detection, tokenization, POS tagging, Chunking, NER and lemma. support
> >> for co-reference resolution and dependency trees is currently missing.
> >>
> >> Stanford NLP is already integrated with Stanbol [4]. At the moment it
> >> only supports English, but I do already work to include the other
> >> supported languages. Other NLP framework that is already integrated
> >> with Stanbol are Freeling [5] and Talismane [6]. But note that for all
> >> those the integration excludes support for co-reference and dependency
> >> trees.
> >>
> >> Anyways I am confident that one can implement a first prototype by
> >> only using Sentences and POS tags and - if available - Chunks (e.g.
> >> Noun phrases).
> >>
> >>
> > I assume that in the Stanbol context, a feature like Relation extraction
> > would be implemented as an EnhancementEngine?
> > What kind of effort would be required for a co-reference resolution tool
> > integration into Stanbol?
> >
>
> Yes in the end it would be an EnhancementEngine. But before we can
> build such an engine we would need to
>
> * extend the Stanbol NLP processing API with Annotations for co-reference
> * add support for JSON Serialisation/Parsing for those annotation so
> that the RESTful NLP Analysis Service can provide co-reference
> information
>
> > At this moment I'll be focusing on 2 aspects:
> >
> > 1. Determine the best data structure to encapsulate the extracted
> > information. I'll take a closer look at Dolce.
>
> Don't make to to complex. Defining a proper structure to represent
> Events will only pay-off if we can also successfully extract such
> information form processed texts.
>
> I would start with
>
>  * fise:SettingAnnotation
>     * {fise:Enhancement} metadata
>
>  * fise:ParticipantAnnotation
>     * {fise:Enhancement} metadata
>     * fise:inSetting {settingAnnotation}
>     * fise:hasMention {textAnnotation}
>     * fise:suggestion {entityAnnotation} (multiple if there are more
> suggestions)
>     * dc:type one of fise:Agent, fise:Patient, fise:Instrument, fise:Cause
>
>  * fise:OccurrentAnnotation
>     * {fise:Enhancement} metadata
>     * fise:inSetting {settingAnnotation}
>     * fise:hasMention {textAnnotation}
>     * dc:type set to fise:Activity
>
> If it turns out that we can extract more, we can add more structure to
> those annotations. We might also think about using an own namespace
> for those extensions to the annotation structure.
>
> > 2. Determine how should all of this be integrated into Stanbol.
>
> Just create an EventExtractionEngine and configure a enhancement chain
> that does NLP processing and EntityLinking.
>
> You should have a look at
>
> * SentimentSummarizationEngine [1] as it does a lot of things with NLP
> processing results (e.g. connecting adjectives (via verbs) to
> nouns/pronouns. So as long we can not use explicit dependency trees
> you code will need to do similar things with Nouns, Pronouns and
> Verbs.
>
> * Disambigutation-MLT engine, as it creates a Java representation of
> present fise:TextAnnotation and fise:EntityAnnotation [2]. Something
> similar will also be required by the EventExtractionEngine for fast
> access to such annotations while iterating over the Sentences of the
> text.
>
>
> best
> Rupert
>
> [1]
> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java
> [2]
> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java
>
> >
> > Thanks
> >
> > Hope this helps to bootstrap this discussion
> >> best
> >> Rupert
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Relation extraction feature

Reply via email to