Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Alessandra Donnini Fri, 25 May 2012 08:51:45 -0700

thanks it works
Is there a details about CELI engines contribution to enhancement? 
I'm working with italian language for our demo, by using also our SKOS 
thesaurus, and I would like to improve enhancement quality. As service user, 
suggestion 2 is a very interesting way to interpret numbers coming from an 
engine that, for us, is a black box.
regards
Alessandra



Il giorno 25/mag/2012, alle ore 08.40, Rupert Westenthaler ha scritto:

> 
> On 24.05.2012, at 17:15, Alessandra Donnini wrote:
> 
>> Hi Rupert 
>> I tried the two options but I have some doubt:
>> 1) Short option
>> get the CELI engine bundle and install it to a recent Stanbol 0.10.0 
>> launcher: install means copy il place of {trunk}/enhancer directory?
> 
> sorry for being not more precise ...
>  ... with "install" I referred to  "installing the bundle of the CELI engines 
> to the running OSGI environment. 
> 
> Assuming that you have already checked out and build the branch you have two 
> options to do that
> 
> 1. via the console
> 
> go to directory  "{branch}/engines/celi" and call
> 
>    mvn clean install -PinstallBundle 
> -Dsling.url=http://localhost:8080/system/console
> 
> 1. via the Apache Felix Web Console
> 
> go in the browser to "http://localhost:8080/system/console/bundles";
> press the "Install/Update..." Button (top right corner)
> add the CELI bundle from "{branch}/engines/celi/target"
> 
> 
> Both options assume that Stanbol 0.10.0 runs at localhost port 8080.
> 
>> 2) Complete workflow
>> what you mean with "check out the branch [1]" in the complete workflow list? 
>> Do I need to substitute {trunk}/enhancer directory with 
>> {branch}/celi-enhancement-engines directory? 
>> 
> 
> no you should check out the branch to an other directory.
> 
>   mvn install
> 
> does copy compiled modules to your local maven repository
> 
>  ~/.m2/repository
> 
> so if you compile the branch it will override the modules of the trunk in 
> your local Repository (with the version of the branch). Because of this  if 
> you afterwards only compile the Full Launcher module in the trunk it will 
> take the jars versions of the branch.
> 
> best
> Rupert
> 
> 
>> thanks
>> Alessandra
>> Il giorno 24/mag/2012, alle ore 11.58, Rupert Westenthaler ha scritto:
>> 
>>> Hi
>>> 
>>> Am 24.05.2012 um 08:50 schrieb Alessandra Donnini <[email protected]>:
>>> 
>>>> Are the new CELI enhancement engines available in the last release 
>>>> apache-stanbol-0.9.0-incubating (2012/05/08)  available in 
>>>> http://incubator.apache.org/stanbol/downloads/releases.html?
>>>> Do I need to download files from 
>>>> https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so 
>>>> how should I do?
>>>> thanks
>>>> Alessandra Donnini
>>>> 
>>> 
>>> Currently the plan is to include The CELI engines only in 0.10.0.
>>> The engines are not yet included in the trunk, but available in a branch 
>>> [1] as there are still two remaining issues with the NER engine. If those 
>>> are solved the engines should be included in the trunk within days.
>>> 
>>> To use the CELI engines in the current state you will need to
>>> 
>>> Short option (should work)
>>> 
>>> * check out the branch 
>>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>>> * call mvn install in the branch
>>> * get the CELI engine bundle and install it to a recent Stanbol 0.10.0 
>>> launcher
>>> 
>>> The complete workflow would be to
>>> 
>>> * check out and "mvn install" the trunk
>>> * check out the branch [1]
>>> * call mvn install in the branch
>>> * go back to {trunk}/launchers/full
>>> * call mvn clean install - this will create a full launcher that includes 
>>> the bundles as build in the branch
>>> * use this launcher to start Stanbol
>>> 
>>> best
>>> Rupert
>>> 
>>> [1] 
>>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>>> 
>>>> 
>>>> 
>>>> 
>>>> Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> In the last two weeks I considerable improved the validation of the
>>>>> Enhancements created by the different Stanbol Enhancement Engines.
>>>>> Here is the list of related issues:
>>>>> 
>>>>> * STANBOL-613: Define how to retrieve the language of the parsed content
>>>>> * STANBOL-617: Define how to encode fise:TopicEnhancements
>>>>> * STANBOL-625: Add link to the entityhub:site if suggested Entity is
>>>>> available via the Entityhub
>>>>> 
>>>>> Note also STANBOL-612 - providing a utility class that easily allows
>>>>> to validate created enhancements in unit tests of EnhancementEngines.
>>>>> All existing engines do now use this utility to validate Enhancements.
>>>>> This is also true for the contributed CELI engine (STANBOL-583)
>>>>> already confirm to those tests.
>>>>> 
>>>>> The next think I would like to make more clear (and easier to
>>>>> use/understand) is how confidence is represented for Stanbol
>>>>> Enhancements. Related to this I would like to discuss the following
>>>>> two suggestions:
>>>>> 
>>>>> ### Suggestion 1: Require confidence values to be in the range [0..1]
>>>>> 
>>>>> This is an long going discussion, but I would really like to add a
>>>>> check that enforces confidence values to be in the range between
>>>>> [0..1].
>>>>> 
>>>>> I think this change is necessary, because it moves the responsibility
>>>>> for interpreting confidence values from the Stanbol users to the
>>>>> implementors of the Engines. I know that providing confidence values
>>>>> is a hard thing to do, but while it may be hard for Engine developers
>>>>> it is near to impossible to Stanbol users to do so.
>>>>> 
>>>>> Note that EnhancementEngine would still be free to create Enhancements
>>>>> with no "fise:confidence" value.
>>>>> 
>>>>> Surprisingly a lot of the existing Engines do already confirm to this
>>>>> rule. The most prominent exception is the Named Entity Tagging Engine
>>>>> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented
>>>>> already an algorithm that normalizes confidence values by a
>>>>> combination of the levenshtein distance (selected-text <-> entity
>>>>> label) and the Solr result score for the Entity (see STANBOL-624 for
>>>>> details).
>>>>> 
>>>>> If we could agree to this rule I would use a similar approach also for
>>>>> other Engines that do not yet normalize confidence values between
>>>>> [0..1]
>>>>> 
>>>>> ### Suggestion 2: Add fise:confidence-level property
>>>>> 
>>>>> The "confidence-level" is intended to make it easier for clients to
>>>>> decide how to process Enhancements. It would not use a numerical range
>>>>> but four distinct values:
>>>>> 
>>>>> * confident: Meaning that a match is very likely - indicating that
>>>>> those annotations typically can be accepted automatically (e.g. If the
>>>>> EntityLinking engine finds a single Entity that exactly matches the
>>>>> text selected by an text annotation)
>>>>> * ambiguous: Meaning that there are several possibilities but is is
>>>>> still likely that one of them match (e.g. Paris, Paris (Texas))
>>>>> * suggestion: Meaning that the match is not completely certain, but
>>>>> there are not several options (e.g. Germans -> Germany)
>>>>> * uncertain: Meaning that Entities do match, but the probability of a
>>>>> match is rather speculative (e.g. John -> Elton John)
>>>>> 
>>>>> IMHO using this classification would fit a lot of engines much better
>>>>> as the numeric "fise:confidence" property as it does not rise the
>>>>> expectation in users that confidence values are on a rational scale
>>>>> (e.g. a Enhancement with a confidence of "0.8" is not two times as
>>>>> likely as one with "0.4").
>>>>> 
>>>>> Engines would have the possibility to manually add those information
>>>>> to enhancements. For enhancements that do not define those we could
>>>>> implement an post-processing engine that adds those based on generic
>>>>> rules.
>>>>> 
>>>>> e.g.
>>>>> 
>>>>> * ignore Enhancements with an existing "confidence-level" assignment
>>>>> * TextAnnotations with a confidence value > 0.8 => confident
>>>>> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion
>>>>> * TextAnnotations with a confidence value < 0.5 => uncertain
>>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>>> confidence > 0.8 => confident
>>>>> * TextAnnotations with a several linked EntityAnnotation with a
>>>>> confidence > 0.8 => ambiguous *)
>>>>> * TextAnnotations with several linked EntityAnnotations with a
>>>>> confidence > 0.5 but no one > 0.8 => ambiguous *)
>>>>> * TextAnnotations with a single linked EntityAnnotation with a
>>>>> confidence < 0.8 > 0.5 => suggestion
>>>>> * TextAnnotations with EntityAnnotations with confidence values < 0.5
>>>>> => uncertain
>>>>> * TopicAnnotation with a confidence value > 0.8 => confident
>>>>> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion
>>>>> * TopicAnnotation with a confidence value < 0.5 => uncertain
>>>>> 
>>>>> *) NOTE that in those cases only EntityAnnotations with a confidence
>>>>> value > 0.5 would be marked as "ambiguous". Additional
>>>>> EntityAnnotations with confidence values < 0.5 would be marked as
>>>>> "uncertain"
>>>>> 
>>>>> The values '0.8' and '0.5' should be configurable.
>>>>> 
>>>>> Note that "fise:confidence-level" could be also used by Engines that
>>>>> can not provide fise:confidence values (E.g. the langid engine could
>>>>> mark detected languages as "uncertain" if the parsed text was to
>>>>> short).
>>>>> 
>>>>> WDYT
>>>>> Rupert
>>>>> 
>>>>> 
>>>>> -- 
>>>>> | Rupert Westenthaler             [email protected]
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>> 
>> 
>

Re: [Suggestion] Enhancement confidence range [0..1] and addition of confidence-levels

Reply via email to