[ 
https://issues.apache.org/jira/browse/STANBOL-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated STANBOL-1262:
-----------------------------------------

    Description: 
The first step of EntityLinking (applies to all EntityLinkingEngines incl. the 
Lucene FST Linking Engine) is that it classifies Tokens as "linkable", 
"matchable" and "others". In addition it determines "processible" chunks Tokens 
are contained in.

This issue is about changing the way how "processible" chunks are determined if 
the AnalyzedText contains multiple overlapping chunks.

A typical case where this can happen is if both a Noun Phrase Detection and a 
Named Entity Recognition is contained in the Chain. The chunks selected by 
Named Entities will typically be smaller as the corresponding Noun Phrase. 
There are even situations where the Named Entity does not even include all 
Nouns contained in a Noun Phrase.

Here an Example taken from [1]:

    After a disappointing start against an Everton side who led through Kevin 
Mirallas's first-half goal ...

While "Everton" is detected as Organization by NER, the Noun Phrase "an Everton 
side" also include 'side' as an 2nd noun. Therefore 'Everton' is not considered 
for linking as it only matches a 1/2 matchable tokens within a 'processible 
phrase'

This is because EntityLinking currently merges overlapping processible phrase 
together. A semantic that is - no longer - an optimal for EntityLinking.

To avoid recall problems like described the last Chunk emitted by the 
AnalyzedText should be used instead. For the above example this would result in

 - an [other]: an Everton side
 - Everton [linkable]: Everton
 - side [matchable]: an Everton side

So 'Everton' would get correctly linked to an Entity with the label Everton but 
'side' would not get linked to an Entity with the label Side, as it is in a 
Phrase with an other linkable/matchable token.

An other example would be ' ... the University of Munich is ... ' where one 
could expect Noun Phrases for 'the Univerity' and 'Munich' (if single token 
noun phrases are emitted by the chunker component). In addition as a result of 
the NER engine one can expect a chunk for 'Univerity of Munich'. 

 - the [other]: the University
 - University [matchable]: University of Munich
 - of [other]: University of Munich
 - Munich [linkable]: Munich

This would result in the linking rules that 'University' is only linked to 
Entities that also match Munich in their Label while Munich would be also 
linked to Entities that just include Munich. A small differentiation to the 
current implementation where Munich alone would not get linked as all the 
chunks would get merged to a big one covering 'the University of Munich'.

[1] 
http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report


  was:
The first step of EntityLinking (applies to all EntityLinkingEngines incl. the 
Lucene FST Linking Engine) is that it classifies Tokens as "linkable", 
"matchable" and "others". In addition it determines "processible" chunks Tokens 
are contained in.

This issue is about changing the way how "processible" chunks are determined if 
the AnalyzedText contains multiple overlapping chunks.

A typical case where this can happen is if both a Noun Phrase Detection and a 
Named Entity Recognition is contained in the Chain. The chunks selected by 
Named Entities will typically be smaller as the corresponding Noun Phrase. 
There are even situations where the Named Entity does not even include all 
Nouns contained in a Noun Phrase.

Here an Example taken from [1]:

    After a disappointing start against an Everton side who led through Kevin 
Mirallas's first-half goal ...

While "Everton" is detected as Organization by NER, the Noun Phrase "an Everton 
side" also include 'side' as an 2nd noun. Therefore 'Everton' is not considered 
for linking as it only matches a 1/2 matchable tokens within a 'processible 
phrase'

This is because EntityLinking currently merges overlapping processible phrase 
together. A semantic that is - no longer - an optimal for EntityLinking.

To avoid recall problems like described the intersection instead of the union 
of multiple processible chunks need to be used.

For the given example this would result in

 - an [other]: an Everton side
 - Everton [linkable]: Everton
 - side [matchable]: an Everton side

So 'Everton' would get correctly linked to an Entity with the label Everton but 
'side' would not get linked to an Entity with the label Side, as it is in a 
Phrase with an other linkable/matchable token.


[1] 
http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report



> Change/Improve processing of Chunks by EntityLinking 
> -----------------------------------------------------
>
>                 Key: STANBOL-1262
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1262
>             Project: Stanbol
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> The first step of EntityLinking (applies to all EntityLinkingEngines incl. 
> the Lucene FST Linking Engine) is that it classifies Tokens as "linkable", 
> "matchable" and "others". In addition it determines "processible" chunks 
> Tokens are contained in.
> This issue is about changing the way how "processible" chunks are determined 
> if the AnalyzedText contains multiple overlapping chunks.
> A typical case where this can happen is if both a Noun Phrase Detection and a 
> Named Entity Recognition is contained in the Chain. The chunks selected by 
> Named Entities will typically be smaller as the corresponding Noun Phrase. 
> There are even situations where the Named Entity does not even include all 
> Nouns contained in a Noun Phrase.
> Here an Example taken from [1]:
>     After a disappointing start against an Everton side who led through Kevin 
> Mirallas's first-half goal ...
> While "Everton" is detected as Organization by NER, the Noun Phrase "an 
> Everton side" also include 'side' as an 2nd noun. Therefore 'Everton' is not 
> considered for linking as it only matches a 1/2 matchable tokens within a 
> 'processible phrase'
> This is because EntityLinking currently merges overlapping processible phrase 
> together. A semantic that is - no longer - an optimal for EntityLinking.
> To avoid recall problems like described the last Chunk emitted by the 
> AnalyzedText should be used instead. For the above example this would result 
> in
>  - an [other]: an Everton side
>  - Everton [linkable]: Everton
>  - side [matchable]: an Everton side
> So 'Everton' would get correctly linked to an Entity with the label Everton 
> but 'side' would not get linked to an Entity with the label Side, as it is in 
> a Phrase with an other linkable/matchable token.
> An other example would be ' ... the University of Munich is ... ' where one 
> could expect Noun Phrases for 'the Univerity' and 'Munich' (if single token 
> noun phrases are emitted by the chunker component). In addition as a result 
> of the NER engine one can expect a chunk for 'Univerity of Munich'. 
>  - the [other]: the University
>  - University [matchable]: University of Munich
>  - of [other]: University of Munich
>  - Munich [linkable]: Munich
> This would result in the linking rules that 'University' is only linked to 
> Entities that also match Munich in their Label while Munich would be also 
> linked to Entities that just include Munich. A small differentiation to the 
> current implementation where Munich alone would not get linked as all the 
> chunks would get merged to a big one covering 'the University of Munich'.
> [1] 
> http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to