Rupert Westenthaler created STANBOL-1262:
--------------------------------------------
Summary: Change/Improve processing of Chunks by EntityLinking
Key: STANBOL-1262
URL: https://issues.apache.org/jira/browse/STANBOL-1262
Project: Stanbol
Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
The first step of EntityLinking (applies to all EntityLinkingEngines incl. the
Lucene FST Linking Engine) is that it classifies Tokens as "linkable",
"matchable" and "others". In addition it determines "processible" chunks Tokens
are contained in.
This issue is about changing the way how "processible" chunks are determined if
the AnalyzedText contains multiple overlapping chunks.
A typical case where this can happen is if both a Noun Phrase Detection and a
Named Entity Recognition is contained in the Chain. The chunks selected by
Named Entities will typically be smaller as the corresponding Noun Phrase.
There are even situations where the Named Entity does not even include all
Nouns contained in a Noun Phrase.
Here an Example taken from [1]:
After a disappointing start against an Everton side who led through Kevin
Mirallas's first-half goal ...
While "Everton" is detected as Organization by NER, the Noun Phrase "an Everton
side" also include 'side' as an 2nd noun. Therefore 'Everton' is not considered
for linking as it only matches a 1/2 matchable tokens within a 'processible
phrase'
This is because EntityLinking currently merges overlapping processible phrase
together. A semantic that is - no longer - an optimal for EntityLinking.
To avoid recall problems like described the intersection instead of the union
of multiple processible chunks need to be used.
For the given example this would result in
- an [other]: an Everton side
- Everton [linkable]: Everton
- side [matchable]: an Everton side
So 'Everton' would get correctly linked to an Entity with the label Everton but
'side' would not get linked to an Entity with the label Side, as it is in a
Phrase with an other linkable/matchable token.
[1]
http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)