Re: Relation extraction feature

Rupert Westenthaler Thu, 04 Jul 2013 21:58:50 -0700

Hi Cristian,

I created the branch at


    http://svn.apache.org/repos/asf/stanbol/branches/nlp-dep-tree-and-co-ref/

ATM in contains only the "nlp" and "nlp-json" module. Let me know if
you would like to have more

best
Rupert



On Thu, Jul 4, 2013 at 10:14 AM, Cristian Petroaca
<[email protected]> wrote:
> Hi Rupert,
>
> I created jiras : https://issues.apache.org/jira/browse/STANBOL-1132 and
> https://issues.apache.org/jira/browse/STANBOL-1133. The original one in
> dependent upon these.
> Please let me know when I can start using the branch.
>
> Thanks,
> Cristian
>
>
> 2013/6/27 Cristian Petroaca <[email protected]>
>
>>
>>
>>
>> 2013/6/27 Rupert Westenthaler <[email protected]>
>>
>>> On Thu, Jun 27, 2013 at 3:12 PM, Cristian Petroaca
>>> <[email protected]> wrote:
>>> > Sorry, I meant the Stanbol NLP API, not Stanford in my previous e-mail.
>>> By
>>> > the way, does Open NLP have the ability to build dependency trees?
>>> >
>>>
>>> AFAIK OpenNLP does not provide this feature.
>>>
>>
>> Then , since the Stanford NLP lib is also integrated into Stanbol, I'll
>> take a look at how I can extend its integration to include the dependency
>> tree feature.
>>
>>>
>>>
>>  >
>>> > 2013/6/23 Cristian Petroaca <[email protected]>
>>> >
>>> >> Hi Rupert,
>>> >>
>>> >> I created jira https://issues.apache.org/jira/browse/STANBOL-1121.
>>> >> As you suggested I would start with extending the Stanford NLP with
>>> >> co-reference resolution but I think also with dependency trees because
>>> I
>>> >> also need to know the Subject of the sentence and the object that it
>>> >> affects, right?
>>> >>
>>> >> Given that I need to extend the Stanford NLP API in Stanbol for
>>> >> co-reference and dependency trees, how do I proceed with this? Do I
>>> create
>>> >> 2 new sub-tasks to the already opened Jira? After that can I start
>>> >> implementing on my local copy of Stanbol and when I'm done I'll send
>>> you
>>> >> guys the patch fo review?
>>> >>
>>>
>>> I would create two "New Feature" type Issues one for adding support
>>> for "dependency trees" and the other for "co-reference" support. You
>>> should also define "depends on" relations between STANBOL-1121 and
>>> those two new issues.
>>>
>>> Sub-task could also work, but as adding those features would be also
>>> interesting for other things I would rather define them as separate
>>> issues.
>>>
>>>
>> 2 New Features connected with the original jira it is then.
>>
>>
>>> If you would prefer to work in an own branch please tell me. This
>>> could have the advantage that patches would not be affected by changes
>>> in the trunk.
>>>
>>> Yes, a separate branch sounds good.
>>
>> best
>>> Rupert
>>>
>>> >> Regards,
>>> >> Cristian
>>> >>
>>> >>
>>> >> 2013/6/18 Rupert Westenthaler <[email protected]>
>>> >>
>>> >>> On Mon, Jun 17, 2013 at 10:18 PM, Cristian Petroaca
>>> >>> <[email protected]> wrote:
>>> >>> > Hi Rupert,
>>> >>> >
>>> >>> > Agreed on the
>>> SettingAnnotation/ParticipantAnnotation/OccurentAnnotation
>>> >>> > data structure.
>>> >>> >
>>> >>> > Should I open up a Jira for all of this in order to encapsulate this
>>> >>> > information and establish the goals and these initial steps towards
>>> >>> these
>>> >>> > goals?
>>> >>>
>>> >>> Yes please. A JIRA issue for this work would be great.
>>> >>>
>>> >>> > How should I proceed further? Should I create some design documents
>>> that
>>> >>> > need to be reviewed?
>>> >>>
>>> >>> Usually it is the best to write design related text directly in JIRA
>>> >>> by using Markdown [1] syntax. This will allow us later to use this
>>> >>> text directly for the documentation on the Stanbol Webpage.
>>> >>>
>>> >>> best
>>> >>> Rupert
>>> >>>
>>> >>>
>>> >>> [1] http://daringfireball.net/projects/markdown/
>>> >>> >
>>> >>> > Regards,
>>> >>> > Cristian
>>> >>> >
>>> >>> >
>>> >>> > 2013/6/17 Rupert Westenthaler <[email protected]>
>>> >>> >
>>> >>> >> On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca
>>> >>> >> <[email protected]> wrote:
>>> >>> >> > HI Rupert,
>>> >>> >> >
>>> >>> >> > First of all thanks for the detailed suggestions.
>>> >>> >> >
>>> >>> >> > 2013/6/12 Rupert Westenthaler <[email protected]>
>>> >>> >> >
>>> >>> >> >> Hi Cristian, all
>>> >>> >> >>
>>> >>> >> >> really interesting use case!
>>> >>> >> >>
>>> >>> >> >> In this mail I will try to give some suggestions on how this
>>> could
>>> >>> >> >> work out. This suggestions are mainly based on experiences and
>>> >>> lessons
>>> >>> >> >> learned in the LIVE [2] project where we built an information
>>> system
>>> >>> >> >> for the Olympic Games in Peking. While this Project excluded the
>>> >>> >> >> extraction of Events from unstructured text (because the Olympic
>>> >>> >> >> Information System was already providing event data as XML
>>> messages)
>>> >>> >> >> the semantic search capabilities of this system where very
>>> similar
>>> >>> as
>>> >>> >> >> the one described by your use case.
>>> >>> >> >>
>>> >>> >> >> IMHO you are not only trying to extract relations, but a formal
>>> >>> >> >> representation of the situation described by the text. So lets
>>> >>> assume
>>> >>> >> >> that the goal is to Annotate a Setting (or Situation) described
>>> in
>>> >>> the
>>> >>> >> >> text - a fise:SettingAnnotation.
>>> >>> >> >>
>>> >>> >> >> The DOLCE foundational ontology [1] gives some advices on how to
>>> >>> model
>>> >>> >> >> those. The important relation for modeling this Participation:
>>> >>> >> >>
>>> >>> >> >>     PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t))
>>> >>> >> >>
>>> >>> >> >> where ..
>>> >>> >> >>
>>> >>> >> >>  * ED are Endurants (continuants): Endurants do have an
>>> identity so
>>> >>> we
>>> >>> >> >> would typically refer to them as Entities referenced by a
>>> setting.
>>> >>> >> >> Note that this includes physical, non-physical as well as
>>> >>> >> >> social-objects.
>>> >>> >> >>  * PD are Perdurants (occurrents):  Perdurants are entities that
>>> >>> >> >> happen in time. This refers to Events, Activities ...
>>> >>> >> >>  * PC are Participation: It is an time indexed relation where
>>> >>> >> >> Endurants participate in Perdurants
>>> >>> >> >>
>>> >>> >> >> Modeling this in RDF requires to define some intermediate
>>> resources
>>> >>> >> >> because RDF does not allow for n-ary relations.
>>> >>> >> >>
>>> >>> >> >>  * fise:SettingAnnotation: It is really handy to define one
>>> resource
>>> >>> >> >> being the context for all described data. I would call this
>>> >>> >> >> "fise:SettingAnnotation" and define it as a sub-concept to
>>> >>> >> >> fise:Enhancement. All further enhancement about the extracted
>>> >>> Setting
>>> >>> >> >> would define a "fise:in-setting" relation to it.
>>> >>> >> >>
>>> >>> >> >>  * fise:ParticipantAnnotation: Is used to annotate that
>>> Endurant is
>>> >>> >> >> participating on a setting (fise:in-setting
>>> fise:SettingAnnotation).
>>> >>> >> >> The Endurant itself is described by existing fise:TextAnnotaion
>>> (the
>>> >>> >> >> mentions) and fise:EntityAnnotation (suggested Entities).
>>> Basically
>>> >>> >> >> the fise:ParticipantAnnotation will allow an EnhancementEngine
>>> to
>>> >>> >> >> state that several mentions (in possible different sentences) do
>>> >>> >> >> represent the same Endurant as participating in the Setting. In
>>> >>> >> >> addition it would be possible to use the dc:type property
>>> (similar
>>> >>> as
>>> >>> >> >> for fise:TextAnnotation) to refer to the role(s) of an
>>> participant
>>> >>> >> >> (e.g. the set: Agent (intensionally performs an action) Cause
>>> >>> >> >> (unintentionally e.g. a mud slide), Patient (a passive role in
>>> an
>>> >>> >> >> activity) and Instrument (aids an process)), but I am wondering
>>> if
>>> >>> one
>>> >>> >> >> could extract those information.
>>> >>> >> >>
>>> >>> >> >> * fise:OccurrentAnnotation: is used to annotate a Perdurant in
>>> the
>>> >>> >> >> context of the Setting. Also fise:OccurrentAnnotation can link
>>> to
>>> >>> >> >> fise:TextAnnotaion (typically verbs in the text defining the
>>> >>> >> >> perdurant) as well as fise:EntityAnnotation suggesting well
>>> known
>>> >>> >> >> Events in a knowledge base (e.g. a Election in a country, or an
>>> >>> >> >> upraising ...). In addition fise:OccurrentAnnotation can define
>>> >>> >> >> dc:has-participant links to fise:ParticipantAnnotation. In this
>>> case
>>> >>> >> >> it is explicitly stated hat an Endurant (the
>>> >>> >> >> fise:ParticipantAnnotation) involved in this Perturant (the
>>> >>> >> >> fise:OccurrentAnnotation). As Occurrences are temporal indexed
>>> this
>>> >>> >> >> annotation should also support properties for defining the
>>> >>> >> >> xsd:dateTime for the start/end.
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> Indeed, an event based data structure makes a lot of sense with
>>> the
>>> >>> >> remark
>>> >>> >> > that you probably won't be able to always extract the date for a
>>> >>> given
>>> >>> >> > setting(situation).
>>> >>> >> > There are 2 thing which are unclear though.
>>> >>> >> >
>>> >>> >> > 1. Perdurant : You could have situations in which the object upon
>>> >>> which
>>> >>> >> the
>>> >>> >> > Subject ( or Endurant ) is acting is not a transitory object (
>>> such
>>> >>> as an
>>> >>> >> > event, activity ) but rather another Endurant. For example we can
>>> >>> have
>>> >>> >> the
>>> >>> >> > phrase "USA invades Irak" where "USA" is the Endurant ( Subject )
>>> >>> which
>>> >>> >> > performs the action of "invading" on another Eundurant, namely
>>> >>> "Irak".
>>> >>> >> >
>>> >>> >>
>>> >>> >> By using CAOS, USA would be the Agent and Iraq the Patient. Both
>>> are
>>> >>> >> Endurants. The activity "invading" would be the Perdurant. So
>>> ideally
>>> >>> >> you would have a  "fise:SettingAnnotation" with:
>>> >>> >>
>>> >>> >>   * fise:ParticipantAnnotation for USA with the dc:type caos:Agent,
>>> >>> >> linking to a fise:TextAnnotation for "USA" and a
>>> fise:EntityAnnotation
>>> >>> >> linking to dbpedia:United_States
>>> >>> >>   * fise:ParticipantAnnotation for Iraq with the dc:type
>>> caos:Patient,
>>> >>> >> linking to a fise:TextAnnotation for "Irak" and a
>>> >>> >> fise:EntityAnnotation linking to  dbpedia:Iraq
>>> >>> >>   * fise:OccurrentAnnotation for "invades" with the dc:type
>>> >>> >> caos:Activity, linking to a fise:TextAnnotation for "invades"
>>> >>> >>
>>> >>> >> > 2. Where does the verb, which links the Subject and the Object
>>> come
>>> >>> into
>>> >>> >> > this? I imagined that the Endurant would have a dc:"property"
>>> where
>>> >>> the
>>> >>> >> > property = verb which links to the Object in noun form. For
>>> example
>>> >>> take
>>> >>> >> > again the sentence "USA invades Irak". You would have the "USA"
>>> >>> Entity
>>> >>> >> with
>>> >>> >> > dc:invader which points to the Object "Irak". The Endurant would
>>> >>> have as
>>> >>> >> > many dc:"property" elements as there are verbs which link it to
>>> an
>>> >>> >> Object.
>>> >>> >>
>>> >>> >> As explained above you would have a fise:OccurrentAnnotation that
>>> >>> >> represents the Perdurant. The information that the activity
>>> mention in
>>> >>> >> the text is "invades" would be by linking to a
>>> fise:TextAnnotation. If
>>> >>> >> you can also provide an Ontology for Tasks that defines
>>> >>> >> "myTasks:invade" the fise:OccurrentAnnotation could also link to an
>>> >>> >> fise:EntityAnnotation for this concept.
>>> >>> >>
>>> >>> >> best
>>> >>> >> Rupert
>>> >>> >>
>>> >>> >> >
>>> >>> >> > ### Consuming the data:
>>> >>> >> >>
>>> >>> >> >> I think this model should be sufficient for use-cases as
>>> described
>>> >>> by
>>> >>> >> you.
>>> >>> >> >>
>>> >>> >> >> Users would be able to consume data on the setting level. This
>>> can
>>> >>> be
>>> >>> >> >> done my simple retrieving all fise:ParticipantAnnotation as
>>> well as
>>> >>> >> >> fise:OccurrentAnnotation linked with a setting. BTW this was the
>>> >>> >> >> approach used in LIVE [2] for semantic search. It allows
>>> queries for
>>> >>> >> >> Settings that involve specific Entities e.g. you could filter
>>> for
>>> >>> >> >> Settings that involve a {Person}, activities:Arrested and a
>>> specific
>>> >>> >> >> {Upraising}. However note that with this approach you will get
>>> >>> results
>>> >>> >> >> for Setting where the {Person} participated and an other person
>>> was
>>> >>> >> >> arrested.
>>> >>> >> >>
>>> >>> >> >> An other possibility would be to process enhancement results on
>>> the
>>> >>> >> >> fise:OccurrentAnnotation. This would allow to a much higher
>>> >>> >> >> granularity level (e.g. it would allow to correctly answer the
>>> query
>>> >>> >> >> used as an example above). But I am wondering if the quality of
>>> the
>>> >>> >> >> Setting extraction will be sufficient for this. I have also
>>> doubts
>>> >>> if
>>> >>> >> >> this can be still realized by using semantic indexing to Apache
>>> Solr
>>> >>> >> >> or if it would be better/necessary to store results in a
>>> TripleStore
>>> >>> >> >> and using SPARQL for retrieval.
>>> >>> >> >>
>>> >>> >> >> The methodology and query language used by YAGO [3] is also very
>>> >>> >> >> relevant for this (especially note chapter 7 SPOTL(X)
>>> >>> Representation).
>>> >>> >> >>
>>> >>> >> >> An other related Topic is the enrichment of Entities (especially
>>> >>> >> >> Events) in knowledge bases based on Settings extracted form
>>> >>> Documents.
>>> >>> >> >> As per definition - in DOLCE - Perdurants are temporal indexed.
>>> That
>>> >>> >> >> means that at the time when added to a knowledge base they might
>>> >>> still
>>> >>> >> >> be in process. So the creation, enriching and refinement of such
>>> >>> >> >> Entities in a the knowledge base seams to be critical for a
>>> System
>>> >>> >> >> like described in your use-case.
>>> >>> >> >>
>>> >>> >> >> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca
>>> >>> >> >> <[email protected]> wrote:
>>> >>> >> >> >
>>> >>> >> >> > First of all I have to mention that I am new in the field of
>>> >>> semantic
>>> >>> >> >> > technologies, I've started to read about them in the last 4-5
>>> >>> >> >> months.Having
>>> >>> >> >> > said that I have a high level overview of what is a good
>>> approach
>>> >>> to
>>> >>> >> >> solve
>>> >>> >> >> > this problem. There are a number of papers on the internet
>>> which
>>> >>> >> describe
>>> >>> >> >> > what steps need to be taken such as : named entity
>>> recognition,
>>> >>> >> >> > co-reference resolution, pos tagging and others.
>>> >>> >> >>
>>> >>> >> >> The Stanbol NLP processing module currently only supports
>>> sentence
>>> >>> >> >> detection, tokenization, POS tagging, Chunking, NER and lemma.
>>> >>> support
>>> >>> >> >> for co-reference resolution and dependency trees is currently
>>> >>> missing.
>>> >>> >> >>
>>> >>> >> >> Stanford NLP is already integrated with Stanbol [4]. At the
>>> moment
>>> >>> it
>>> >>> >> >> only supports English, but I do already work to include the
>>> other
>>> >>> >> >> supported languages. Other NLP framework that is already
>>> integrated
>>> >>> >> >> with Stanbol are Freeling [5] and Talismane [6]. But note that
>>> for
>>> >>> all
>>> >>> >> >> those the integration excludes support for co-reference and
>>> >>> dependency
>>> >>> >> >> trees.
>>> >>> >> >>
>>> >>> >> >> Anyways I am confident that one can implement a first prototype
>>> by
>>> >>> >> >> only using Sentences and POS tags and - if available - Chunks
>>> (e.g.
>>> >>> >> >> Noun phrases).
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> > I assume that in the Stanbol context, a feature like Relation
>>> >>> extraction
>>> >>> >> > would be implemented as an EnhancementEngine?
>>> >>> >> > What kind of effort would be required for a co-reference
>>> resolution
>>> >>> tool
>>> >>> >> > integration into Stanbol?
>>> >>> >> >
>>> >>> >>
>>> >>> >> Yes in the end it would be an EnhancementEngine. But before we can
>>> >>> >> build such an engine we would need to
>>> >>> >>
>>> >>> >> * extend the Stanbol NLP processing API with Annotations for
>>> >>> co-reference
>>> >>> >> * add support for JSON Serialisation/Parsing for those annotation
>>> so
>>> >>> >> that the RESTful NLP Analysis Service can provide co-reference
>>> >>> >> information
>>> >>> >>
>>> >>> >> > At this moment I'll be focusing on 2 aspects:
>>> >>> >> >
>>> >>> >> > 1. Determine the best data structure to encapsulate the extracted
>>> >>> >> > information. I'll take a closer look at Dolce.
>>> >>> >>
>>> >>> >> Don't make to to complex. Defining a proper structure to represent
>>> >>> >> Events will only pay-off if we can also successfully extract such
>>> >>> >> information form processed texts.
>>> >>> >>
>>> >>> >> I would start with
>>> >>> >>
>>> >>> >>  * fise:SettingAnnotation
>>> >>> >>     * {fise:Enhancement} metadata
>>> >>> >>
>>> >>> >>  * fise:ParticipantAnnotation
>>> >>> >>     * {fise:Enhancement} metadata
>>> >>> >>     * fise:inSetting {settingAnnotation}
>>> >>> >>     * fise:hasMention {textAnnotation}
>>> >>> >>     * fise:suggestion {entityAnnotation} (multiple if there are
>>> more
>>> >>> >> suggestions)
>>> >>> >>     * dc:type one of fise:Agent, fise:Patient, fise:Instrument,
>>> >>> fise:Cause
>>> >>> >>
>>> >>> >>  * fise:OccurrentAnnotation
>>> >>> >>     * {fise:Enhancement} metadata
>>> >>> >>     * fise:inSetting {settingAnnotation}
>>> >>> >>     * fise:hasMention {textAnnotation}
>>> >>> >>     * dc:type set to fise:Activity
>>> >>> >>
>>> >>> >> If it turns out that we can extract more, we can add more
>>> structure to
>>> >>> >> those annotations. We might also think about using an own namespace
>>> >>> >> for those extensions to the annotation structure.
>>> >>> >>
>>> >>> >> > 2. Determine how should all of this be integrated into Stanbol.
>>> >>> >>
>>> >>> >> Just create an EventExtractionEngine and configure a enhancement
>>> chain
>>> >>> >> that does NLP processing and EntityLinking.
>>> >>> >>
>>> >>> >> You should have a look at
>>> >>> >>
>>> >>> >> * SentimentSummarizationEngine [1] as it does a lot of things with
>>> NLP
>>> >>> >> processing results (e.g. connecting adjectives (via verbs) to
>>> >>> >> nouns/pronouns. So as long we can not use explicit dependency trees
>>> >>> >> you code will need to do similar things with Nouns, Pronouns and
>>> >>> >> Verbs.
>>> >>> >>
>>> >>> >> * Disambigutation-MLT engine, as it creates a Java representation
>>> of
>>> >>> >> present fise:TextAnnotation and fise:EntityAnnotation [2].
>>> Something
>>> >>> >> similar will also be required by the EventExtractionEngine for fast
>>> >>> >> access to such annotations while iterating over the Sentences of
>>> the
>>> >>> >> text.
>>> >>> >>
>>> >>> >>
>>> >>> >> best
>>> >>> >> Rupert
>>> >>> >>
>>> >>> >> [1]
>>> >>> >>
>>> >>>
>>> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java
>>> >>> >> [2]
>>> >>> >>
>>> >>>
>>> https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java
>>> >>> >>
>>> >>> >> >
>>> >>> >> > Thanks
>>> >>> >> >
>>> >>> >> > Hope this helps to bootstrap this discussion
>>> >>> >> >> best
>>> >>> >> >> Rupert
>>> >>> >> >>
>>> >>> >> >> --
>>> >>> >> >> | Rupert Westenthaler             [email protected]
>>> >>> >> >> | Bodenlehenstraße 11
>>> ++43-699-11108907
>>> >>> >> >> | A-5500 Bischofshofen
>>> >>> >> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> --
>>> >>> >> | Rupert Westenthaler             [email protected]
>>> >>> >> | Bodenlehenstraße 11
>>> ++43-699-11108907
>>> >>> >> | A-5500 Bischofshofen
>>> >>> >>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> | Rupert Westenthaler             [email protected]
>>> >>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> >>> | A-5500 Bischofshofen
>>> >>>
>>> >>
>>> >>
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             [email protected]
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Relation extraction feature

Reply via email to