Re: Relation extraction feature

Cristian Petroaca Thu, 27 Jun 2013 08:33:34 -0700

I went through the Open NLP Parser documentation and aside from some POS
tagging it also seems to group the different POS elements together via
parentheses. From their example, they group the subject and the verb
together and leave the object separately.May be of some use but it's not
real dependency parsing.
What deters me from using this parser are the phrases "The tool is only
intended for demonstration and testing" and "Right now the tree insert
parser is still experimental".



2013/6/27 Rafa Haro <[email protected]>

> Hi Cristian and Rupert,
>
> El 27/06/13 15:25, Rupert Westenthaler escribió:
>
>  On Thu, Jun 27, 2013 at 3:12 PM, Cristian Petroaca
>> <[email protected]> wrote:
>>
>>> Sorry, I meant the Stanbol NLP API, not Stanford in my previous e-mail.
>>> By
>>> the way, does Open NLP have the ability to build dependency trees?
>>>
>>>  AFAIK OpenNLP does not provide this feature.
>>
> I haven't been able to find an exact answer to that because, in one hand,
> according to OpenNLP documentation, it provides an english parser that
> seems to go further to shallow parsing or chunking but, in the other hand,
> it seems that it's not actually a dependency parser as you can find, at
> least, at Stanford CoreNLP solution.
>
>
>>  2013/6/23 Cristian Petroaca <[email protected]>
>>>
>>>  Hi Rupert,
>>>>
>>>> I created jira 
>>>> https://issues.apache.org/**jira/browse/STANBOL-1121<https://issues.apache.org/jira/browse/STANBOL-1121>
>>>> .
>>>> As you suggested I would start with extending the Stanford NLP with
>>>> co-reference resolution but I think also with dependency trees because I
>>>> also need to know the Subject of the sentence and the object that it
>>>> affects, right?
>>>>
>>>> Given that I need to extend the Stanford NLP API in Stanbol for
>>>> co-reference and dependency trees, how do I proceed with this? Do I
>>>> create
>>>> 2 new sub-tasks to the already opened Jira? After that can I start
>>>> implementing on my local copy of Stanbol and when I'm done I'll send you
>>>> guys the patch fo review?
>>>>
>>>>  I would create two "New Feature" type Issues one for adding support
>> for "dependency trees" and the other for "co-reference" support. You
>> should also define "depends on" relations between STANBOL-1121 and
>> those two new issues.
>>
>> Sub-task could also work, but as adding those features would be also
>> interesting for other things I would rather define them as separate
>> issues.
>>
>> If you would prefer to work in an own branch please tell me. This
>> could have the advantage that patches would not be affected by changes
>> in the trunk.
>>
>> best
>> Rupert
>>
> Regards
>
>>
>>  Regards,
>>>> Cristian
>>>>
>>>>
>>>> 2013/6/18 Rupert Westenthaler <[email protected]**>
>>>>
>>>>  On Mon, Jun 17, 2013 at 10:18 PM, Cristian Petroaca
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Hi Rupert,
>>>>>>
>>>>>> Agreed on the SettingAnnotation/**ParticipantAnnotation/**
>>>>>> OccurentAnnotation
>>>>>> data structure.
>>>>>>
>>>>>> Should I open up a Jira for all of this in order to encapsulate this
>>>>>> information and establish the goals and these initial steps towards
>>>>>>
>>>>> these
>>>>>
>>>>>> goals?
>>>>>>
>>>>> Yes please. A JIRA issue for this work would be great.
>>>>>
>>>>>  How should I proceed further? Should I create some design documents
>>>>>> that
>>>>>> need to be reviewed?
>>>>>>
>>>>> Usually it is the best to write design related text directly in JIRA
>>>>> by using Markdown [1] syntax. This will allow us later to use this
>>>>> text directly for the documentation on the Stanbol Webpage.
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>>
>>>>> [1] 
>>>>> http://daringfireball.net/**projects/markdown/<http://daringfireball.net/projects/markdown/>
>>>>>
>>>>>> Regards,
>>>>>> Cristian
>>>>>>
>>>>>>
>>>>>> 2013/6/17 Rupert Westenthaler <[email protected]**>
>>>>>>
>>>>>>  On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> HI Rupert,
>>>>>>>>
>>>>>>>> First of all thanks for the detailed suggestions.
>>>>>>>>
>>>>>>>> 2013/6/12 Rupert Westenthaler <[email protected]**>
>>>>>>>>
>>>>>>>>  Hi Cristian, all
>>>>>>>>>
>>>>>>>>> really interesting use case!
>>>>>>>>>
>>>>>>>>> In this mail I will try to give some suggestions on how this could
>>>>>>>>> work out. This suggestions are mainly based on experiences and
>>>>>>>>>
>>>>>>>> lessons
>>>>>
>>>>>> learned in the LIVE [2] project where we built an information system
>>>>>>>>> for the Olympic Games in Peking. While this Project excluded the
>>>>>>>>> extraction of Events from unstructured text (because the Olympic
>>>>>>>>> Information System was already providing event data as XML
>>>>>>>>> messages)
>>>>>>>>> the semantic search capabilities of this system where very similar
>>>>>>>>>
>>>>>>>> as
>>>>>
>>>>>> the one described by your use case.
>>>>>>>>>
>>>>>>>>> IMHO you are not only trying to extract relations, but a formal
>>>>>>>>> representation of the situation described by the text. So lets
>>>>>>>>>
>>>>>>>> assume
>>>>>
>>>>>> that the goal is to Annotate a Setting (or Situation) described in
>>>>>>>>>
>>>>>>>> the
>>>>>
>>>>>> text - a fise:SettingAnnotation.
>>>>>>>>>
>>>>>>>>> The DOLCE foundational ontology [1] gives some advices on how to
>>>>>>>>>
>>>>>>>> model
>>>>>
>>>>>> those. The important relation for modeling this Participation:
>>>>>>>>>
>>>>>>>>>      PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t))
>>>>>>>>>
>>>>>>>>> where ..
>>>>>>>>>
>>>>>>>>>   * ED are Endurants (continuants): Endurants do have an identity
>>>>>>>>> so
>>>>>>>>>
>>>>>>>> we
>>>>>
>>>>>> would typically refer to them as Entities referenced by a setting.
>>>>>>>>> Note that this includes physical, non-physical as well as
>>>>>>>>> social-objects.
>>>>>>>>>   * PD are Perdurants (occurrents):  Perdurants are entities that
>>>>>>>>> happen in time. This refers to Events, Activities ...
>>>>>>>>>   * PC are Participation: It is an time indexed relation where
>>>>>>>>> Endurants participate in Perdurants
>>>>>>>>>
>>>>>>>>> Modeling this in RDF requires to define some intermediate resources
>>>>>>>>> because RDF does not allow for n-ary relations.
>>>>>>>>>
>>>>>>>>>   * fise:SettingAnnotation: It is really handy to define one
>>>>>>>>> resource
>>>>>>>>> being the context for all described data. I would call this
>>>>>>>>> "fise:SettingAnnotation" and define it as a sub-concept to
>>>>>>>>> fise:Enhancement. All further enhancement about the extracted
>>>>>>>>>
>>>>>>>> Setting
>>>>>
>>>>>> would define a "fise:in-setting" relation to it.
>>>>>>>>>
>>>>>>>>>   * fise:ParticipantAnnotation: Is used to annotate that Endurant
>>>>>>>>> is
>>>>>>>>> participating on a setting (fise:in-setting
>>>>>>>>> fise:SettingAnnotation).
>>>>>>>>> The Endurant itself is described by existing fise:TextAnnotaion
>>>>>>>>> (the
>>>>>>>>> mentions) and fise:EntityAnnotation (suggested Entities). Basically
>>>>>>>>> the fise:ParticipantAnnotation will allow an EnhancementEngine to
>>>>>>>>> state that several mentions (in possible different sentences) do
>>>>>>>>> represent the same Endurant as participating in the Setting. In
>>>>>>>>> addition it would be possible to use the dc:type property (similar
>>>>>>>>>
>>>>>>>> as
>>>>>
>>>>>> for fise:TextAnnotation) to refer to the role(s) of an participant
>>>>>>>>> (e.g. the set: Agent (intensionally performs an action) Cause
>>>>>>>>> (unintentionally e.g. a mud slide), Patient (a passive role in an
>>>>>>>>> activity) and Instrument (aids an process)), but I am wondering if
>>>>>>>>>
>>>>>>>> one
>>>>>
>>>>>> could extract those information.
>>>>>>>>>
>>>>>>>>> * fise:OccurrentAnnotation: is used to annotate a Perdurant in the
>>>>>>>>> context of the Setting. Also fise:OccurrentAnnotation can link to
>>>>>>>>> fise:TextAnnotaion (typically verbs in the text defining the
>>>>>>>>> perdurant) as well as fise:EntityAnnotation suggesting well known
>>>>>>>>> Events in a knowledge base (e.g. a Election in a country, or an
>>>>>>>>> upraising ...). In addition fise:OccurrentAnnotation can define
>>>>>>>>> dc:has-participant links to fise:ParticipantAnnotation. In this
>>>>>>>>> case
>>>>>>>>> it is explicitly stated hat an Endurant (the
>>>>>>>>> fise:ParticipantAnnotation) involved in this Perturant (the
>>>>>>>>> fise:OccurrentAnnotation). As Occurrences are temporal indexed this
>>>>>>>>> annotation should also support properties for defining the
>>>>>>>>> xsd:dateTime for the start/end.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Indeed, an event based data structure makes a lot of sense with the
>>>>>>>>>
>>>>>>>> remark
>>>>>>>
>>>>>>>> that you probably won't be able to always extract the date for a
>>>>>>>>
>>>>>>> given
>>>>>
>>>>>> setting(situation).
>>>>>>>> There are 2 thing which are unclear though.
>>>>>>>>
>>>>>>>> 1. Perdurant : You could have situations in which the object upon
>>>>>>>>
>>>>>>> which
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> Subject ( or Endurant ) is acting is not a transitory object ( such
>>>>>>>>
>>>>>>> as an
>>>>>
>>>>>> event, activity ) but rather another Endurant. For example we can
>>>>>>>>
>>>>>>> have
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> phrase "USA invades Irak" where "USA" is the Endurant ( Subject )
>>>>>>>>
>>>>>>> which
>>>>>
>>>>>> performs the action of "invading" on another Eundurant, namely
>>>>>>>>
>>>>>>> "Irak".
>>>>>
>>>>>> By using CAOS, USA would be the Agent and Iraq the Patient. Both are
>>>>>>> Endurants. The activity "invading" would be the Perdurant. So ideally
>>>>>>> you would have a  "fise:SettingAnnotation" with:
>>>>>>>
>>>>>>>    * fise:ParticipantAnnotation for USA with the dc:type caos:Agent,
>>>>>>> linking to a fise:TextAnnotation for "USA" and a
>>>>>>> fise:EntityAnnotation
>>>>>>> linking to dbpedia:United_States
>>>>>>>    * fise:ParticipantAnnotation for Iraq with the dc:type
>>>>>>> caos:Patient,
>>>>>>> linking to a fise:TextAnnotation for "Irak" and a
>>>>>>> fise:EntityAnnotation linking to  dbpedia:Iraq
>>>>>>>    * fise:OccurrentAnnotation for "invades" with the dc:type
>>>>>>> caos:Activity, linking to a fise:TextAnnotation for "invades"
>>>>>>>
>>>>>>>  2. Where does the verb, which links the Subject and the Object come
>>>>>>>>
>>>>>>> into
>>>>>
>>>>>> this? I imagined that the Endurant would have a dc:"property" where
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> property = verb which links to the Object in noun form. For example
>>>>>>>>
>>>>>>> take
>>>>>
>>>>>> again the sentence "USA invades Irak". You would have the "USA"
>>>>>>>>
>>>>>>> Entity
>>>>>
>>>>>> with
>>>>>>>
>>>>>>>> dc:invader which points to the Object "Irak". The Endurant would
>>>>>>>>
>>>>>>> have as
>>>>>
>>>>>> many dc:"property" elements as there are verbs which link it to an
>>>>>>>>
>>>>>>> Object.
>>>>>>>
>>>>>>> As explained above you would have a fise:OccurrentAnnotation that
>>>>>>> represents the Perdurant. The information that the activity mention
>>>>>>> in
>>>>>>> the text is "invades" would be by linking to a fise:TextAnnotation.
>>>>>>> If
>>>>>>> you can also provide an Ontology for Tasks that defines
>>>>>>> "myTasks:invade" the fise:OccurrentAnnotation could also link to an
>>>>>>> fise:EntityAnnotation for this concept.
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>>  ### Consuming the data:
>>>>>>>>
>>>>>>>>> I think this model should be sufficient for use-cases as described
>>>>>>>>>
>>>>>>>> by
>>>>>
>>>>>> you.
>>>>>>>
>>>>>>>> Users would be able to consume data on the setting level. This can
>>>>>>>>>
>>>>>>>> be
>>>>>
>>>>>> done my simple retrieving all fise:ParticipantAnnotation as well as
>>>>>>>>> fise:OccurrentAnnotation linked with a setting. BTW this was the
>>>>>>>>> approach used in LIVE [2] for semantic search. It allows queries
>>>>>>>>> for
>>>>>>>>> Settings that involve specific Entities e.g. you could filter for
>>>>>>>>> Settings that involve a {Person}, activities:Arrested and a
>>>>>>>>> specific
>>>>>>>>> {Upraising}. However note that with this approach you will get
>>>>>>>>>
>>>>>>>> results
>>>>>
>>>>>> for Setting where the {Person} participated and an other person was
>>>>>>>>> arrested.
>>>>>>>>>
>>>>>>>>> An other possibility would be to process enhancement results on the
>>>>>>>>> fise:OccurrentAnnotation. This would allow to a much higher
>>>>>>>>> granularity level (e.g. it would allow to correctly answer the
>>>>>>>>> query
>>>>>>>>> used as an example above). But I am wondering if the quality of the
>>>>>>>>> Setting extraction will be sufficient for this. I have also doubts
>>>>>>>>>
>>>>>>>> if
>>>>>
>>>>>> this can be still realized by using semantic indexing to Apache Solr
>>>>>>>>> or if it would be better/necessary to store results in a
>>>>>>>>> TripleStore
>>>>>>>>> and using SPARQL for retrieval.
>>>>>>>>>
>>>>>>>>> The methodology and query language used by YAGO [3] is also very
>>>>>>>>> relevant for this (especially note chapter 7 SPOTL(X)
>>>>>>>>>
>>>>>>>> Representation).
>>>>>
>>>>>> An other related Topic is the enrichment of Entities (especially
>>>>>>>>> Events) in knowledge bases based on Settings extracted form
>>>>>>>>>
>>>>>>>> Documents.
>>>>>
>>>>>> As per definition - in DOLCE - Perdurants are temporal indexed. That
>>>>>>>>> means that at the time when added to a knowledge base they might
>>>>>>>>>
>>>>>>>> still
>>>>>
>>>>>> be in process. So the creation, enriching and refinement of such
>>>>>>>>> Entities in a the knowledge base seams to be critical for a System
>>>>>>>>> like described in your use-case.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> First of all I have to mention that I am new in the field of
>>>>>>>>>>
>>>>>>>>> semantic
>>>>>
>>>>>> technologies, I've started to read about them in the last 4-5
>>>>>>>>>>
>>>>>>>>> months.Having
>>>>>>>>>
>>>>>>>>>> said that I have a high level overview of what is a good approach
>>>>>>>>>>
>>>>>>>>> to
>>>>>
>>>>>> solve
>>>>>>>>>
>>>>>>>>>> this problem. There are a number of papers on the internet which
>>>>>>>>>>
>>>>>>>>> describe
>>>>>>>
>>>>>>>> what steps need to be taken such as : named entity recognition,
>>>>>>>>>> co-reference resolution, pos tagging and others.
>>>>>>>>>>
>>>>>>>>> The Stanbol NLP processing module currently only supports sentence
>>>>>>>>> detection, tokenization, POS tagging, Chunking, NER and lemma.
>>>>>>>>>
>>>>>>>> support
>>>>>
>>>>>> for co-reference resolution and dependency trees is currently
>>>>>>>>>
>>>>>>>> missing.
>>>>>
>>>>>> Stanford NLP is already integrated with Stanbol [4]. At the moment
>>>>>>>>>
>>>>>>>> it
>>>>>
>>>>>> only supports English, but I do already work to include the other
>>>>>>>>> supported languages. Other NLP framework that is already integrated
>>>>>>>>> with Stanbol are Freeling [5] and Talismane [6]. But note that for
>>>>>>>>>
>>>>>>>> all
>>>>>
>>>>>> those the integration excludes support for co-reference and
>>>>>>>>>
>>>>>>>> dependency
>>>>>
>>>>>> trees.
>>>>>>>>>
>>>>>>>>> Anyways I am confident that one can implement a first prototype by
>>>>>>>>> only using Sentences and POS tags and - if available - Chunks (e.g.
>>>>>>>>> Noun phrases).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  I assume that in the Stanbol context, a feature like Relation
>>>>>>>>
>>>>>>> extraction
>>>>>
>>>>>> would be implemented as an EnhancementEngine?
>>>>>>>> What kind of effort would be required for a co-reference resolution
>>>>>>>>
>>>>>>> tool
>>>>>
>>>>>> integration into Stanbol?
>>>>>>>>
>>>>>>>>  Yes in the end it would be an EnhancementEngine. But before we can
>>>>>>> build such an engine we would need to
>>>>>>>
>>>>>>> * extend the Stanbol NLP processing API with Annotations for
>>>>>>>
>>>>>> co-reference
>>>>>
>>>>>> * add support for JSON Serialisation/Parsing for those annotation so
>>>>>>> that the RESTful NLP Analysis Service can provide co-reference
>>>>>>> information
>>>>>>>
>>>>>>>  At this moment I'll be focusing on 2 aspects:
>>>>>>>>
>>>>>>>> 1. Determine the best data structure to encapsulate the extracted
>>>>>>>> information. I'll take a closer look at Dolce.
>>>>>>>>
>>>>>>> Don't make to to complex. Defining a proper structure to represent
>>>>>>> Events will only pay-off if we can also successfully extract such
>>>>>>> information form processed texts.
>>>>>>>
>>>>>>> I would start with
>>>>>>>
>>>>>>>   * fise:SettingAnnotation
>>>>>>>      * {fise:Enhancement} metadata
>>>>>>>
>>>>>>>   * fise:ParticipantAnnotation
>>>>>>>      * {fise:Enhancement} metadata
>>>>>>>      * fise:inSetting {settingAnnotation}
>>>>>>>      * fise:hasMention {textAnnotation}
>>>>>>>      * fise:suggestion {entityAnnotation} (multiple if there are more
>>>>>>> suggestions)
>>>>>>>      * dc:type one of fise:Agent, fise:Patient, fise:Instrument,
>>>>>>>
>>>>>> fise:Cause
>>>>>
>>>>>>   * fise:OccurrentAnnotation
>>>>>>>      * {fise:Enhancement} metadata
>>>>>>>      * fise:inSetting {settingAnnotation}
>>>>>>>      * fise:hasMention {textAnnotation}
>>>>>>>      * dc:type set to fise:Activity
>>>>>>>
>>>>>>> If it turns out that we can extract more, we can add more structure
>>>>>>> to
>>>>>>> those annotations. We might also think about using an own namespace
>>>>>>> for those extensions to the annotation structure.
>>>>>>>
>>>>>>>  2. Determine how should all of this be integrated into Stanbol.
>>>>>>>>
>>>>>>> Just create an EventExtractionEngine and configure a enhancement
>>>>>>> chain
>>>>>>> that does NLP processing and EntityLinking.
>>>>>>>
>>>>>>> You should have a look at
>>>>>>>
>>>>>>> * SentimentSummarizationEngine [1] as it does a lot of things with
>>>>>>> NLP
>>>>>>> processing results (e.g. connecting adjectives (via verbs) to
>>>>>>> nouns/pronouns. So as long we can not use explicit dependency trees
>>>>>>> you code will need to do similar things with Nouns, Pronouns and
>>>>>>> Verbs.
>>>>>>>
>>>>>>> * Disambigutation-MLT engine, as it creates a Java representation of
>>>>>>> present fise:TextAnnotation and fise:EntityAnnotation [2]. Something
>>>>>>> similar will also be required by the EventExtractionEngine for fast
>>>>>>> access to such annotations while iterating over the Sentences of the
>>>>>>> text.
>>>>>>>
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>>  https://svn.apache.org/repos/**asf/stanbol/trunk/enhancement-**
>>>>> engines/sentiment-**summarization/src/main/java/**
>>>>> org/apache/stanbol/enhancer/**engines/sentiment/summarize/**
>>>>> SentimentSummarizationEngine.**java<https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java>
>>>>>
>>>>>> [2]
>>>>>>>
>>>>>>>  https://svn.apache.org/repos/**asf/stanbol/trunk/enhancement-**
>>>>> engines/disambiguation-mlt/**src/main/java/org/apache/**
>>>>> stanbol/enhancer/engine/**disambiguation/mlt/**DisambiguationData.java<https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java>
>>>>>
>>>>>> Thanks
>>>>>>>>
>>>>>>>> Hope this helps to bootstrap this discussion
>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>> ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> | Rupert Westenthaler             [email protected]
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>>
>>>>>
>>>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.

Re: Relation extraction feature

Reply via email to