Rupert Westenthaler created STANBOL-624:
-------------------------------------------

             Summary: The NamedEntityTagging engine should use confidence 
values between [0..1]
                 Key: STANBOL-624
                 URL: https://issues.apache.org/jira/browse/STANBOL-624
             Project: Stanbol
          Issue Type: Bug
          Components: Enhancer
    Affects Versions: 0.9.0-incubating
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler
             Fix For: 0.10.0-incubating


Currently the Solr result scores are used as confidence. Only exact matches are 
sorted in front of partial matches. However Solr result scores are not within 
the range [0..1] what makes it hard for clients to process confidence values.

The suggestion is to use the following algorithm to "normalize" confidence 
values of this engine

* score ... the Solr result score of the current entity
* maxScore ... the highest Solr result score
* maxExactScore ... the highest Solr result score of an Entity the exactly 
matches the fise:selected-text
* levenshteinSimilarity ... the 
LevenshteinDistance(selectedText,label)/Math.max(selectedText.length(),label.length())

The normalized Score is calculated as follows:

    if(levenshteinSimilarity == 1) //exact match
        score = score/maxExactScore;
    else
        score = score*levenshteinSimilarity/maxScore

This ensures that

* If there is a exact match it will have the confidence 1.0
* If there are multiple exact matches they will be rated based on the Solr 
result scores (normalized to 1 using the result score of the best exact match 
as base)
* all partial matches will have a score <= the levenshteinSimilarity
* Partial matches are normalized by using the max result score (regardless if 
the result with the max Solr result score is a exact match or not).

Note: This resembles a disambiguation based on the label of the Entity as well 
as possible Document Boosts in the Solr index. This is NOT intended to be a 
real Entity Disambiguation algorithm.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to