Re: Relation extraction feature

Rupert Westenthaler Mon, 17 Jun 2013 02:51:20 -0700

On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca
<[email protected]> wrote:
> HI Rupert,
>
> First of all thanks for the detailed suggestions.
>
> 2013/6/12 Rupert Westenthaler <[email protected]>
>
>> Hi Cristian, all
>>
>> really interesting use case!
>>
>> In this mail I will try to give some suggestions on how this could
>> work out. This suggestions are mainly based on experiences and lessons
>> learned in the LIVE [2] project where we built an information system
>> for the Olympic Games in Peking. While this Project excluded the
>> extraction of Events from unstructured text (because the Olympic
>> Information System was already providing event data as XML messages)
>> the semantic search capabilities of this system where very similar as
>> the one described by your use case.
>>
>> IMHO you are not only trying to extract relations, but a formal
>> representation of the situation described by the text. So lets assume
>> that the goal is to Annotate a Setting (or Situation) described in the
>> text - a fise:SettingAnnotation.
>>
>> The DOLCE foundational ontology [1] gives some advices on how to model
>> those. The important relation for modeling this Participation:
>>
>>     PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t))
>>
>> where ..
>>
>>  * ED are Endurants (continuants): Endurants do have an identity so we
>> would typically refer to them as Entities referenced by a setting.
>> Note that this includes physical, non-physical as well as
>> social-objects.
>>  * PD are Perdurants (occurrents):  Perdurants are entities that
>> happen in time. This refers to Events, Activities ...
>>  * PC are Participation: It is an time indexed relation where
>> Endurants participate in Perdurants
>>
>> Modeling this in RDF requires to define some intermediate resources
>> because RDF does not allow for n-ary relations.
>>
>>  * fise:SettingAnnotation: It is really handy to define one resource
>> being the context for all described data. I would call this
>> "fise:SettingAnnotation" and define it as a sub-concept to
>> fise:Enhancement. All further enhancement about the extracted Setting
>> would define a "fise:in-setting" relation to it.
>>
>>  * fise:ParticipantAnnotation: Is used to annotate that Endurant is
>> participating on a setting (fise:in-setting fise:SettingAnnotation).
>> The Endurant itself is described by existing fise:TextAnnotaion (the
>> mentions) and fise:EntityAnnotation (suggested Entities). Basically
>> the fise:ParticipantAnnotation will allow an EnhancementEngine to
>> state that several mentions (in possible different sentences) do
>> represent the same Endurant as participating in the Setting. In
>> addition it would be possible to use the dc:type property (similar as
>> for fise:TextAnnotation) to refer to the role(s) of an participant
>> (e.g. the set: Agent (intensionally performs an action) Cause
>> (unintentionally e.g. a mud slide), Patient (a passive role in an
>> activity) and Instrument (aids an process)), but I am wondering if one
>> could extract those information.
>>
>> * fise:OccurrentAnnotation: is used to annotate a Perdurant in the
>> context of the Setting. Also fise:OccurrentAnnotation can link to
>> fise:TextAnnotaion (typically verbs in the text defining the
>> perdurant) as well as fise:EntityAnnotation suggesting well known
>> Events in a knowledge base (e.g. a Election in a country, or an
>> upraising ...). In addition fise:OccurrentAnnotation can define
>> dc:has-participant links to fise:ParticipantAnnotation. In this case
>> it is explicitly stated hat an Endurant (the
>> fise:ParticipantAnnotation) involved in this Perturant (the
>> fise:OccurrentAnnotation). As Occurrences are temporal indexed this
>> annotation should also support properties for defining the
>> xsd:dateTime for the start/end.
>>
>>
>> Indeed, an event based data structure makes a lot of sense with the remark
> that you probably won't be able to always extract the date for a given
> setting(situation).
> There are 2 thing which are unclear though.
>
> 1. Perdurant : You could have situations in which the object upon which the
> Subject ( or Endurant ) is acting is not a transitory object ( such as an
> event, activity ) but rather another Endurant. For example we can have the
> phrase "USA invades Irak" where "USA" is the Endurant ( Subject ) which
> performs the action of "invading" on another Eundurant, namely "Irak".
>


By using CAOS, USA would be the Agent and Iraq the Patient. Both are
Endurants. The activity "invading" would be the Perdurant. So ideally
you would have a  "fise:SettingAnnotation" with:

  * fise:ParticipantAnnotation for USA with the dc:type caos:Agent,
linking to a fise:TextAnnotation for "USA" and a fise:EntityAnnotation
linking to dbpedia:United_States
  * fise:ParticipantAnnotation for Iraq with the dc:type caos:Patient,
linking to a fise:TextAnnotation for "Irak" and a
fise:EntityAnnotation linking to  dbpedia:Iraq
  * fise:OccurrentAnnotation for "invades" with the dc:type
caos:Activity, linking to a fise:TextAnnotation for "invades"

> 2. Where does the verb, which links the Subject and the Object come into
> this? I imagined that the Endurant would have a dc:"property" where the
> property = verb which links to the Object in noun form. For example take
> again the sentence "USA invades Irak". You would have the "USA" Entity with
> dc:invader which points to the Object "Irak". The Endurant would have as
> many dc:"property" elements as there are verbs which link it to an Object.

As explained above you would have a fise:OccurrentAnnotation that
represents the Perdurant. The information that the activity mention in
the text is "invades" would be by linking to a fise:TextAnnotation. If
you can also provide an Ontology for Tasks that defines
"myTasks:invade" the fise:OccurrentAnnotation could also link to an
fise:EntityAnnotation for this concept.

best
Rupert

>
> ### Consuming the data:
>>
>> I think this model should be sufficient for use-cases as described by you.
>>
>> Users would be able to consume data on the setting level. This can be
>> done my simple retrieving all fise:ParticipantAnnotation as well as
>> fise:OccurrentAnnotation linked with a setting. BTW this was the
>> approach used in LIVE [2] for semantic search. It allows queries for
>> Settings that involve specific Entities e.g. you could filter for
>> Settings that involve a {Person}, activities:Arrested and a specific
>> {Upraising}. However note that with this approach you will get results
>> for Setting where the {Person} participated and an other person was
>> arrested.
>>
>> An other possibility would be to process enhancement results on the
>> fise:OccurrentAnnotation. This would allow to a much higher
>> granularity level (e.g. it would allow to correctly answer the query
>> used as an example above). But I am wondering if the quality of the
>> Setting extraction will be sufficient for this. I have also doubts if
>> this can be still realized by using semantic indexing to Apache Solr
>> or if it would be better/necessary to store results in a TripleStore
>> and using SPARQL for retrieval.
>>
>> The methodology and query language used by YAGO [3] is also very
>> relevant for this (especially note chapter 7 SPOTL(X) Representation).
>>
>> An other related Topic is the enrichment of Entities (especially
>> Events) in knowledge bases based on Settings extracted form Documents.
>> As per definition - in DOLCE - Perdurants are temporal indexed. That
>> means that at the time when added to a knowledge base they might still
>> be in process. So the creation, enriching and refinement of such
>> Entities in a the knowledge base seams to be critical for a System
>> like described in your use-case.
>>
>> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca
>> <[email protected]> wrote:
>> >
>> > First of all I have to mention that I am new in the field of semantic
>> > technologies, I've started to read about them in the last 4-5
>> months.Having
>> > said that I have a high level overview of what is a good approach to
>> solve
>> > this problem. There are a number of papers on the internet which describe
>> > what steps need to be taken such as : named entity recognition,
>> > co-reference resolution, pos tagging and others.
>>
>> The Stanbol NLP processing module currently only supports sentence
>> detection, tokenization, POS tagging, Chunking, NER and lemma. support
>> for co-reference resolution and dependency trees is currently missing.
>>
>> Stanford NLP is already integrated with Stanbol [4]. At the moment it
>> only supports English, but I do already work to include the other
>> supported languages. Other NLP framework that is already integrated
>> with Stanbol are Freeling [5] and Talismane [6]. But note that for all
>> those the integration excludes support for co-reference and dependency
>> trees.
>>
>> Anyways I am confident that one can implement a first prototype by
>> only using Sentences and POS tags and - if available - Chunks (e.g.
>> Noun phrases).
>>
>>
> I assume that in the Stanbol context, a feature like Relation extraction
> would be implemented as an EnhancementEngine?
> What kind of effort would be required for a co-reference resolution tool
> integration into Stanbol?
>

Yes in the end it would be an EnhancementEngine. But before we can
build such an engine we would need to

* extend the Stanbol NLP processing API with Annotations for co-reference
* add support for JSON Serialisation/Parsing for those annotation so
that the RESTful NLP Analysis Service can provide co-reference
information

> At this moment I'll be focusing on 2 aspects:
>
> 1. Determine the best data structure to encapsulate the extracted
> information. I'll take a closer look at Dolce.

Don't make to to complex. Defining a proper structure to represent
Events will only pay-off if we can also successfully extract such
information form processed texts.

I would start with

 * fise:SettingAnnotation
    * {fise:Enhancement} metadata

 * fise:ParticipantAnnotation
    * {fise:Enhancement} metadata
    * fise:inSetting {settingAnnotation}
    * fise:hasMention {textAnnotation}
    * fise:suggestion {entityAnnotation} (multiple if there are more
suggestions)
    * dc:type one of fise:Agent, fise:Patient, fise:Instrument, fise:Cause

 * fise:OccurrentAnnotation
    * {fise:Enhancement} metadata
    * fise:inSetting {settingAnnotation}
    * fise:hasMention {textAnnotation}
    * dc:type set to fise:Activity

If it turns out that we can extract more, we can add more structure to
those annotations. We might also think about using an own namespace
for those extensions to the annotation structure.

> 2. Determine how should all of this be integrated into Stanbol.

Just create an EventExtractionEngine and configure a enhancement chain
that does NLP processing and EntityLinking.

You should have a look at

* SentimentSummarizationEngine [1] as it does a lot of things with NLP
processing results (e.g. connecting adjectives (via verbs) to
nouns/pronouns. So as long we can not use explicit dependency trees
you code will need to do similar things with Nouns, Pronouns and
Verbs.

* Disambigutation-MLT engine, as it creates a Java representation of
present fise:TextAnnotation and fise:EntityAnnotation [2]. Something
similar will also be required by the EventExtractionEngine for fast
access to such annotations while iterating over the Sentences of the
text.


best
Rupert

[1] 
https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java
[2] 
https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java

>
> Thanks
>
> Hope this helps to bootstrap this discussion
>> best
>> Rupert
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Relation extraction feature

Reply via email to