Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Fabian Christ Thu, 24 May 2012 00:05:18 -0700

Hi Alessandra,

the CELI engines were not part of the 0.9.0-incubating release. To get
these new engines, you have to checkout the latest sources (trunk) of
Apache Stanbol and compile it yourself.


http://incubator.apache.org/stanbol/docs/trunk/tutorial.html

Best,
 - Fabian

2012/5/24 Alessandra Donnini <[email protected]>:
> Are the new CELI enhancement engines available in the last release 
> apache-stanbol-0.9.0-incubating (2012/05/08)  available in 
> http://incubator.apache.org/stanbol/downloads/releases.html?
> Do I need to download files from 
> https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so how 
> should I do?
> thanks
> Alessandra Donnini
>
>
>
>
> Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto:
>
>> Hi all,
>>
>> In the last two weeks I considerable improved the validation of the
>> Enhancements created by the different Stanbol Enhancement Engines.
>> Here is the list of related issues:
>>
>> * STANBOL-613: Define how to retrieve the language of the parsed content
>> * STANBOL-617: Define how to encode fise:TopicEnhancements
>> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
>> available via the Entityhub
>>
>> Note also STANBOL-612 - providing a utility class that easily allows
>> to validate created enhancements in unit tests of EnhancementEngines.
>> All existing engines do now use this utility to validate Enhancements.
>> This is also true for the contributed CELI engine (STANBOL-583)
>> already confirm to those tests.
>>
>> The next think I would like to make more clear (and easier to
>> use/understand) is how confidence is represented for Stanbol
>> Enhancements. Related to this I would like to discuss the following
>> two suggestions:
>>
>> ### Suggestion 1: Require confidence values to be in the range [0..1]
>>
>> This is an long going discussion, but I would really like to add a
>> check that enforces confidence values to be in the range between
>> [0..1].
>>
>> I think this change is necessary, because it moves the responsibility
>> for interpreting confidence values from the Stanbol users to the
>> implementors of the Engines. I know that providing confidence values
>> is a hard thing to do, but while it may be hard for Engine developers
>> it is near to impossible to Stanbol users to do so.
>>
>> Note that EnhancementEngine would still be free to create Enhancements
>> with no "fise:confidence" value.
>>
>> Surprisingly a lot of the existing Engines do already confirm to this
>> rule. The most prominent exception is the Named Entity Tagging Engine
>> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
>> already an algorithm that normalizes confidence values by a
>> combination of the levenshtein distance (selected-text <-> entity
>> label) and the Solr result score for the Entity (see STANBOL-624 for
>> details).
>>
>> If we could agree to this rule I would use a similar approach also for
>> other Engines that do not yet normalize confidence values between
>> [0..1]
>>
>> ### Suggestion 2: Add fise:confidence-level property
>>
>> The "confidence-level" is intended to make it easier for clients to
>> decide how to process Enhancements. It would not use a numerical range
>> but four distinct values:
>>
>> * confident: Meaning that a match is very likely - indicating that
>> those annotations typically can be accepted automatically (e.g. If the
>> EntityLinking engine finds a single Entity that exactly matches the
>> text selected by an text annotation)
>> * ambiguous: Meaning that there are several possibilities but is is
>> still likely that one of them match (e.g. Paris, Paris (Texas))
>> * suggestion: Meaning that the match is not completely certain, but
>> there are not several options (e.g. Germans -> Germany)
>> * uncertain: Meaning that Entities do match, but the probability of a
>> match is rather speculative (e.g. John -> Elton John)
>>
>> IMHO using this classification would fit a lot of engines much better
>> as the numeric "fise:confidence" property as it does not rise the
>> expectation in users that confidence values are on a rational scale
>> (e.g. a Enhancement with a confidence of "0.8" is not two times as
>> likely as one with "0.4").
>>
>> Engines would have the possibility to manually add those information
>> to enhancements. For enhancements that do not define those we could
>> implement an post-processing engine that adds those based on generic
>> rules.
>>
>> e.g.
>>
>> * ignore Enhancements with an existing "confidence-level" assignment
>> * TextAnnotations with a confidence value > 0.8 => confident
>> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
>> * TextAnnotations with a confidence value < 0.5 => uncertain
>> * TextAnnotations with a single linked EntityAnnotation with a
>> confidence > 0.8 => confident
>> * TextAnnotations with a several linked EntityAnnotation with a
>> confidence > 0.8 => ambiguous *)
>> * TextAnnotations with several linked EntityAnnotations with a
>> confidence > 0.5 but no one > 0.8 => ambiguous *)
>> * TextAnnotations with a single linked EntityAnnotation with a
>> confidence < 0.8 > 0.5 => suggestion
>> * TextAnnotations with EntityAnnotations with confidence values < 0.5
>> => uncertain
>> * TopicAnnotation with a confidence value > 0.8 => confident
>> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
>> * TopicAnnotation with a confidence value < 0.5 => uncertain
>>
>> *) NOTE that in those cases only EntityAnnotations with a confidence
>> value > 0.5 would be marked as "ambiguous". Additional
>> EntityAnnotations with confidence values < 0.5 would be marked as
>> "uncertain"
>>
>> The values '0.8' and '0.5' should be configurable.
>>
>> Note that "fise:confidence-level" could be also used by Engines that
>> can not provide fise:confidence values (E.g. the langid engine could
>> mark detected languages as "uncertain" if the parsed text was to
>> short).
>>
>> WDYT
>> Rupert
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>



-- 
Fabian
http://twitter.com/fctwitt

Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Reply via email to