Hi Cristian and Rupert,

El 27/06/13 15:25, Rupert Westenthaler escribió:
On Thu, Jun 27, 2013 at 3:12 PM, Cristian Petroaca
<cristian.petro...@gmail.com> wrote:
Sorry, I meant the Stanbol NLP API, not Stanford in my previous e-mail. By
the way, does Open NLP have the ability to build dependency trees?

AFAIK OpenNLP does not provide this feature.
I haven't been able to find an exact answer to that because, in one hand, according to OpenNLP documentation, it provides an english parser that seems to go further to shallow parsing or chunking but, in the other hand, it seems that it's not actually a dependency parser as you can find, at least, at Stanford CoreNLP solution.

2013/6/23 Cristian Petroaca <cristian.petro...@gmail.com>

Hi Rupert,

I created jira https://issues.apache.org/jira/browse/STANBOL-1121.
As you suggested I would start with extending the Stanford NLP with
co-reference resolution but I think also with dependency trees because I
also need to know the Subject of the sentence and the object that it
affects, right?

Given that I need to extend the Stanford NLP API in Stanbol for
co-reference and dependency trees, how do I proceed with this? Do I create
2 new sub-tasks to the already opened Jira? After that can I start
implementing on my local copy of Stanbol and when I'm done I'll send you
guys the patch fo review?

I would create two "New Feature" type Issues one for adding support
for "dependency trees" and the other for "co-reference" support. You
should also define "depends on" relations between STANBOL-1121 and
those two new issues.

Sub-task could also work, but as adding those features would be also
interesting for other things I would rather define them as separate
issues.

If you would prefer to work in an own branch please tell me. This
could have the advantage that patches would not be affected by changes
in the trunk.

best
Rupert
Regards

Regards,
Cristian


2013/6/18 Rupert Westenthaler <rupert.westentha...@gmail.com>

On Mon, Jun 17, 2013 at 10:18 PM, Cristian Petroaca
<cristian.petro...@gmail.com> wrote:
Hi Rupert,

Agreed on the SettingAnnotation/ParticipantAnnotation/OccurentAnnotation
data structure.

Should I open up a Jira for all of this in order to encapsulate this
information and establish the goals and these initial steps towards
these
goals?
Yes please. A JIRA issue for this work would be great.

How should I proceed further? Should I create some design documents that
need to be reviewed?
Usually it is the best to write design related text directly in JIRA
by using Markdown [1] syntax. This will allow us later to use this
text directly for the documentation on the Stanbol Webpage.

best
Rupert


[1] http://daringfireball.net/projects/markdown/
Regards,
Cristian


2013/6/17 Rupert Westenthaler <rupert.westentha...@gmail.com>

On Thu, Jun 13, 2013 at 8:22 PM, Cristian Petroaca
<cristian.petro...@gmail.com> wrote:
HI Rupert,

First of all thanks for the detailed suggestions.

2013/6/12 Rupert Westenthaler <rupert.westentha...@gmail.com>

Hi Cristian, all

really interesting use case!

In this mail I will try to give some suggestions on how this could
work out. This suggestions are mainly based on experiences and
lessons
learned in the LIVE [2] project where we built an information system
for the Olympic Games in Peking. While this Project excluded the
extraction of Events from unstructured text (because the Olympic
Information System was already providing event data as XML messages)
the semantic search capabilities of this system where very similar
as
the one described by your use case.

IMHO you are not only trying to extract relations, but a formal
representation of the situation described by the text. So lets
assume
that the goal is to Annotate a Setting (or Situation) described in
the
text - a fise:SettingAnnotation.

The DOLCE foundational ontology [1] gives some advices on how to
model
those. The important relation for modeling this Participation:

     PC(x, y, t) → (ED(x) ∧ PD(y) ∧ T(t))

where ..

  * ED are Endurants (continuants): Endurants do have an identity so
we
would typically refer to them as Entities referenced by a setting.
Note that this includes physical, non-physical as well as
social-objects.
  * PD are Perdurants (occurrents):  Perdurants are entities that
happen in time. This refers to Events, Activities ...
  * PC are Participation: It is an time indexed relation where
Endurants participate in Perdurants

Modeling this in RDF requires to define some intermediate resources
because RDF does not allow for n-ary relations.

  * fise:SettingAnnotation: It is really handy to define one resource
being the context for all described data. I would call this
"fise:SettingAnnotation" and define it as a sub-concept to
fise:Enhancement. All further enhancement about the extracted
Setting
would define a "fise:in-setting" relation to it.

  * fise:ParticipantAnnotation: Is used to annotate that Endurant is
participating on a setting (fise:in-setting fise:SettingAnnotation).
The Endurant itself is described by existing fise:TextAnnotaion (the
mentions) and fise:EntityAnnotation (suggested Entities). Basically
the fise:ParticipantAnnotation will allow an EnhancementEngine to
state that several mentions (in possible different sentences) do
represent the same Endurant as participating in the Setting. In
addition it would be possible to use the dc:type property (similar
as
for fise:TextAnnotation) to refer to the role(s) of an participant
(e.g. the set: Agent (intensionally performs an action) Cause
(unintentionally e.g. a mud slide), Patient (a passive role in an
activity) and Instrument (aids an process)), but I am wondering if
one
could extract those information.

* fise:OccurrentAnnotation: is used to annotate a Perdurant in the
context of the Setting. Also fise:OccurrentAnnotation can link to
fise:TextAnnotaion (typically verbs in the text defining the
perdurant) as well as fise:EntityAnnotation suggesting well known
Events in a knowledge base (e.g. a Election in a country, or an
upraising ...). In addition fise:OccurrentAnnotation can define
dc:has-participant links to fise:ParticipantAnnotation. In this case
it is explicitly stated hat an Endurant (the
fise:ParticipantAnnotation) involved in this Perturant (the
fise:OccurrentAnnotation). As Occurrences are temporal indexed this
annotation should also support properties for defining the
xsd:dateTime for the start/end.


Indeed, an event based data structure makes a lot of sense with the
remark
that you probably won't be able to always extract the date for a
given
setting(situation).
There are 2 thing which are unclear though.

1. Perdurant : You could have situations in which the object upon
which
the
Subject ( or Endurant ) is acting is not a transitory object ( such
as an
event, activity ) but rather another Endurant. For example we can
have
the
phrase "USA invades Irak" where "USA" is the Endurant ( Subject )
which
performs the action of "invading" on another Eundurant, namely
"Irak".
By using CAOS, USA would be the Agent and Iraq the Patient. Both are
Endurants. The activity "invading" would be the Perdurant. So ideally
you would have a  "fise:SettingAnnotation" with:

   * fise:ParticipantAnnotation for USA with the dc:type caos:Agent,
linking to a fise:TextAnnotation for "USA" and a fise:EntityAnnotation
linking to dbpedia:United_States
   * fise:ParticipantAnnotation for Iraq with the dc:type caos:Patient,
linking to a fise:TextAnnotation for "Irak" and a
fise:EntityAnnotation linking to  dbpedia:Iraq
   * fise:OccurrentAnnotation for "invades" with the dc:type
caos:Activity, linking to a fise:TextAnnotation for "invades"

2. Where does the verb, which links the Subject and the Object come
into
this? I imagined that the Endurant would have a dc:"property" where
the
property = verb which links to the Object in noun form. For example
take
again the sentence "USA invades Irak". You would have the "USA"
Entity
with
dc:invader which points to the Object "Irak". The Endurant would
have as
many dc:"property" elements as there are verbs which link it to an
Object.

As explained above you would have a fise:OccurrentAnnotation that
represents the Perdurant. The information that the activity mention in
the text is "invades" would be by linking to a fise:TextAnnotation. If
you can also provide an Ontology for Tasks that defines
"myTasks:invade" the fise:OccurrentAnnotation could also link to an
fise:EntityAnnotation for this concept.

best
Rupert

### Consuming the data:
I think this model should be sufficient for use-cases as described
by
you.
Users would be able to consume data on the setting level. This can
be
done my simple retrieving all fise:ParticipantAnnotation as well as
fise:OccurrentAnnotation linked with a setting. BTW this was the
approach used in LIVE [2] for semantic search. It allows queries for
Settings that involve specific Entities e.g. you could filter for
Settings that involve a {Person}, activities:Arrested and a specific
{Upraising}. However note that with this approach you will get
results
for Setting where the {Person} participated and an other person was
arrested.

An other possibility would be to process enhancement results on the
fise:OccurrentAnnotation. This would allow to a much higher
granularity level (e.g. it would allow to correctly answer the query
used as an example above). But I am wondering if the quality of the
Setting extraction will be sufficient for this. I have also doubts
if
this can be still realized by using semantic indexing to Apache Solr
or if it would be better/necessary to store results in a TripleStore
and using SPARQL for retrieval.

The methodology and query language used by YAGO [3] is also very
relevant for this (especially note chapter 7 SPOTL(X)
Representation).
An other related Topic is the enrichment of Entities (especially
Events) in knowledge bases based on Settings extracted form
Documents.
As per definition - in DOLCE - Perdurants are temporal indexed. That
means that at the time when added to a knowledge base they might
still
be in process. So the creation, enriching and refinement of such
Entities in a the knowledge base seams to be critical for a System
like described in your use-case.

On Tue, Jun 11, 2013 at 9:09 PM, Cristian Petroaca
<cristian.petro...@gmail.com> wrote:
First of all I have to mention that I am new in the field of
semantic
technologies, I've started to read about them in the last 4-5
months.Having
said that I have a high level overview of what is a good approach
to
solve
this problem. There are a number of papers on the internet which
describe
what steps need to be taken such as : named entity recognition,
co-reference resolution, pos tagging and others.
The Stanbol NLP processing module currently only supports sentence
detection, tokenization, POS tagging, Chunking, NER and lemma.
support
for co-reference resolution and dependency trees is currently
missing.
Stanford NLP is already integrated with Stanbol [4]. At the moment
it
only supports English, but I do already work to include the other
supported languages. Other NLP framework that is already integrated
with Stanbol are Freeling [5] and Talismane [6]. But note that for
all
those the integration excludes support for co-reference and
dependency
trees.

Anyways I am confident that one can implement a first prototype by
only using Sentences and POS tags and - if available - Chunks (e.g.
Noun phrases).


I assume that in the Stanbol context, a feature like Relation
extraction
would be implemented as an EnhancementEngine?
What kind of effort would be required for a co-reference resolution
tool
integration into Stanbol?

Yes in the end it would be an EnhancementEngine. But before we can
build such an engine we would need to

* extend the Stanbol NLP processing API with Annotations for
co-reference
* add support for JSON Serialisation/Parsing for those annotation so
that the RESTful NLP Analysis Service can provide co-reference
information

At this moment I'll be focusing on 2 aspects:

1. Determine the best data structure to encapsulate the extracted
information. I'll take a closer look at Dolce.
Don't make to to complex. Defining a proper structure to represent
Events will only pay-off if we can also successfully extract such
information form processed texts.

I would start with

  * fise:SettingAnnotation
     * {fise:Enhancement} metadata

  * fise:ParticipantAnnotation
     * {fise:Enhancement} metadata
     * fise:inSetting {settingAnnotation}
     * fise:hasMention {textAnnotation}
     * fise:suggestion {entityAnnotation} (multiple if there are more
suggestions)
     * dc:type one of fise:Agent, fise:Patient, fise:Instrument,
fise:Cause
  * fise:OccurrentAnnotation
     * {fise:Enhancement} metadata
     * fise:inSetting {settingAnnotation}
     * fise:hasMention {textAnnotation}
     * dc:type set to fise:Activity

If it turns out that we can extract more, we can add more structure to
those annotations. We might also think about using an own namespace
for those extensions to the annotation structure.

2. Determine how should all of this be integrated into Stanbol.
Just create an EventExtractionEngine and configure a enhancement chain
that does NLP processing and EntityLinking.

You should have a look at

* SentimentSummarizationEngine [1] as it does a lot of things with NLP
processing results (e.g. connecting adjectives (via verbs) to
nouns/pronouns. So as long we can not use explicit dependency trees
you code will need to do similar things with Nouns, Pronouns and
Verbs.

* Disambigutation-MLT engine, as it creates a Java representation of
present fise:TextAnnotation and fise:EntityAnnotation [2]. Something
similar will also be required by the EventExtractionEngine for fast
access to such annotations while iterating over the Sentences of the
text.


best
Rupert

[1]

https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/sentiment-summarization/src/main/java/org/apache/stanbol/enhancer/engines/sentiment/summarize/SentimentSummarizationEngine.java
[2]

https://svn.apache.org/repos/asf/stanbol/trunk/enhancement-engines/disambiguation-mlt/src/main/java/org/apache/stanbol/enhancer/engine/disambiguation/mlt/DisambiguationData.java
Thanks

Hope this helps to bootstrap this discussion
best
Rupert

--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen



--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen



--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen




--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen


--

------------------------------
This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, London W6 7AN.

Reply via email to