[ 
https://issues.apache.org/jira/browse/STANBOL-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler resolved STANBOL-1262.
------------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.12.0

implemented with http://svn.apache.org/r1560281 in 0.12 and merged to trunk 
with http://svn.apache.org/r1560286

> Change/Improve processing of Chunks by EntityLinking 
> -----------------------------------------------------
>
>                 Key: STANBOL-1262
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1262
>             Project: Stanbol
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>             Fix For: 0.12.0
>
>
> The first step of EntityLinking (applies to all EntityLinkingEngines incl. 
> the Lucene FST Linking Engine) is that it classifies Tokens as "linkable", 
> "matchable" and "others". In addition it determines "processible" chunks 
> Tokens are contained in.
> This issue is about changing the way how "processible" chunks are determined 
> if the AnalyzedText contains multiple overlapping chunks.
> A typical case where this can happen is if both a Noun Phrase Detection and a 
> Named Entity Recognition is contained in the Chain. The chunks selected by 
> Named Entities will typically be smaller as the corresponding Noun Phrase. 
> There are even situations where the Named Entity does not even include all 
> Nouns contained in a Noun Phrase.
> Here an Example taken from [1]:
>     After a disappointing start against an Everton side who led through Kevin 
> Mirallas's first-half goal ...
> While "Everton" is detected as Organization by NER, the Noun Phrase "an 
> Everton side" also include 'side' as an 2nd noun. Therefore 'Everton' is not 
> considered for linking as it only matches a 1/2 matchable tokens within a 
> 'processible phrase'
> This is because EntityLinking currently merges overlapping processible phrase 
> together. A semantic that is - no longer - an optimal for EntityLinking.
> To avoid recall problems like described the last Chunk emitted by the 
> AnalyzedText should be used instead. For the above example this would result 
> in
>  - an [other]: an Everton side
>  - Everton [linkable]: Everton
>  - side [matchable]: an Everton side
> So 'Everton' would get correctly linked to an Entity with the label Everton 
> but 'side' would not get linked to an Entity with the label Side, as it is in 
> a Phrase with an other linkable/matchable token.
> An other example would be ' ... the University of Munich is ... ' where one 
> could expect Noun Phrases for 'the Univerity' and 'Munich' (if single token 
> noun phrases are emitted by the chunker component). In addition as a result 
> of the NER engine one can expect a chunk for 'Univerity of Munich'. 
>  - the [other]: the University
>  - University [matchable]: University of Munich
>  - of [other]: University of Munich
>  - Munich [linkable]: Munich
> This would result in the linking rules that 'University' is only linked to 
> Entities that also match Munich in their Label while Munich would be also 
> linked to Entities that just include Munich. A small differentiation to the 
> current implementation where Munich alone would not get linked as all the 
> chunks would get merged to a big one covering 'the University of Munich'.
> [1] 
> http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to