Are the new CELI enhancement engines available in the last release apache-stanbol-0.9.0-incubating (2012/05/08) available in http://incubator.apache.org/stanbol/downloads/releases.html? Do I need to download files from https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so how should I do? thanks Alessandra Donnini
Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto: > Hi all, > > In the last two weeks I considerable improved the validation of the > Enhancements created by the different Stanbol Enhancement Engines. > Here is the list of related issues: > > * STANBOL-613: Define how to retrieve the language of the parsed content > * STANBOL-617: Define how to encode fise:TopicEnhancements > * STANBOL-625: Add link to the entityhub:site if suggested Entity is > available via the Entityhub > > Note also STANBOL-612 - providing a utility class that easily allows > to validate created enhancements in unit tests of EnhancementEngines. > All existing engines do now use this utility to validate Enhancements. > This is also true for the contributed CELI engine (STANBOL-583) > already confirm to those tests. > > The next think I would like to make more clear (and easier to > use/understand) is how confidence is represented for Stanbol > Enhancements. Related to this I would like to discuss the following > two suggestions: > > ### Suggestion 1: Require confidence values to be in the range [0..1] > > This is an long going discussion, but I would really like to add a > check that enforces confidence values to be in the range between > [0..1]. > > I think this change is necessary, because it moves the responsibility > for interpreting confidence values from the Stanbol users to the > implementors of the Engines. I know that providing confidence values > is a hard thing to do, but while it may be hard for Engine developers > it is near to impossible to Stanbol users to do so. > > Note that EnhancementEngine would still be free to create Enhancements > with no "fise:confidence" value. > > Surprisingly a lot of the existing Engines do already confirm to this > rule. The most prominent exception is the Named Entity Tagging Engine > (o.a.s.enhancer.engine.entitytagging). Because of this I implemented > already an algorithm that normalizes confidence values by a > combination of the levenshtein distance (selected-text <-> entity > label) and the Solr result score for the Entity (see STANBOL-624 for > details). > > If we could agree to this rule I would use a similar approach also for > other Engines that do not yet normalize confidence values between > [0..1] > > ### Suggestion 2: Add fise:confidence-level property > > The "confidence-level" is intended to make it easier for clients to > decide how to process Enhancements. It would not use a numerical range > but four distinct values: > > * confident: Meaning that a match is very likely - indicating that > those annotations typically can be accepted automatically (e.g. If the > EntityLinking engine finds a single Entity that exactly matches the > text selected by an text annotation) > * ambiguous: Meaning that there are several possibilities but is is > still likely that one of them match (e.g. Paris, Paris (Texas)) > * suggestion: Meaning that the match is not completely certain, but > there are not several options (e.g. Germans -> Germany) > * uncertain: Meaning that Entities do match, but the probability of a > match is rather speculative (e.g. John -> Elton John) > > IMHO using this classification would fit a lot of engines much better > as the numeric "fise:confidence" property as it does not rise the > expectation in users that confidence values are on a rational scale > (e.g. a Enhancement with a confidence of "0.8" is not two times as > likely as one with "0.4"). > > Engines would have the possibility to manually add those information > to enhancements. For enhancements that do not define those we could > implement an post-processing engine that adds those based on generic > rules. > > e.g. > > * ignore Enhancements with an existing "confidence-level" assignment > * TextAnnotations with a confidence value > 0.8 => confident > * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion > * TextAnnotations with a confidence value < 0.5 => uncertain > * TextAnnotations with a single linked EntityAnnotation with a > confidence > 0.8 => confident > * TextAnnotations with a several linked EntityAnnotation with a > confidence > 0.8 => ambiguous *) > * TextAnnotations with several linked EntityAnnotations with a > confidence > 0.5 but no one > 0.8 => ambiguous *) > * TextAnnotations with a single linked EntityAnnotation with a > confidence < 0.8 > 0.5 => suggestion > * TextAnnotations with EntityAnnotations with confidence values < 0.5 > => uncertain > * TopicAnnotation with a confidence value > 0.8 => confident > * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion > * TopicAnnotation with a confidence value < 0.5 => uncertain > > *) NOTE that in those cases only EntityAnnotations with a confidence > value > 0.5 would be marked as "ambiguous". Additional > EntityAnnotations with confidence values < 0.5 would be marked as > "uncertain" > > The values '0.8' and '0.5' should be configurable. > > Note that "fise:confidence-level" could be also used by Engines that > can not provide fise:confidence values (E.g. the langid engine could > mark detected languages as "uncertain" if the parsed text was to > short). > > WDYT > Rupert > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen
