Rupert Westenthaler created STANBOL-624:
-------------------------------------------
Summary: The NamedEntityTagging engine should use confidence
values between [0..1]
Key: STANBOL-624
URL: https://issues.apache.org/jira/browse/STANBOL-624
Project: Stanbol
Issue Type: Bug
Components: Enhancer
Affects Versions: 0.9.0-incubating
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
Fix For: 0.10.0-incubating
Currently the Solr result scores are used as confidence. Only exact matches are
sorted in front of partial matches. However Solr result scores are not within
the range [0..1] what makes it hard for clients to process confidence values.
The suggestion is to use the following algorithm to "normalize" confidence
values of this engine
* score ... the Solr result score of the current entity
* maxScore ... the highest Solr result score
* maxExactScore ... the highest Solr result score of an Entity the exactly
matches the fise:selected-text
* levenshteinSimilarity ... the
LevenshteinDistance(selectedText,label)/Math.max(selectedText.length(),label.length())
The normalized Score is calculated as follows:
if(levenshteinSimilarity == 1) //exact match
score = score/maxExactScore;
else
score = score*levenshteinSimilarity/maxScore
This ensures that
* If there is a exact match it will have the confidence 1.0
* If there are multiple exact matches they will be rated based on the Solr
result scores (normalized to 1 using the result score of the best exact match
as base)
* all partial matches will have a score <= the levenshteinSimilarity
* Partial matches are normalized by using the max result score (regardless if
the result with the max Solr result score is a exact match or not).
Note: This resembles a disambiguation based on the label of the Entity as well
as possible Document Boosts in the Solr index. This is NOT intended to be a
real Entity Disambiguation algorithm.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira