Hi Alessandra, the CELI engines were not part of the 0.9.0-incubating release. To get these new engines, you have to checkout the latest sources (trunk) of Apache Stanbol and compile it yourself.
http://incubator.apache.org/stanbol/docs/trunk/tutorial.html Best, - Fabian 2012/5/24 Alessandra Donnini <[email protected]>: > Are the new CELI enhancement engines available in the last release > apache-stanbol-0.9.0-incubating (2012/05/08) available in > http://incubator.apache.org/stanbol/downloads/releases.html? > Do I need to download files from > https://issues.apache.org/jira/browse/STANBOL-583 and install them? If so how > should I do? > thanks > Alessandra Donnini > > > > > Il giorno 24/mag/2012, alle ore 08.18, Rupert Westenthaler ha scritto: > >> Hi all, >> >> In the last two weeks I considerable improved the validation of the >> Enhancements created by the different Stanbol Enhancement Engines. >> Here is the list of related issues: >> >> * STANBOL-613: Define how to retrieve the language of the parsed content >> * STANBOL-617: Define how to encode fise:TopicEnhancements >> * STANBOL-625: Add link to the entityhub:site if suggested Entity is >> available via the Entityhub >> >> Note also STANBOL-612 - providing a utility class that easily allows >> to validate created enhancements in unit tests of EnhancementEngines. >> All existing engines do now use this utility to validate Enhancements. >> This is also true for the contributed CELI engine (STANBOL-583) >> already confirm to those tests. >> >> The next think I would like to make more clear (and easier to >> use/understand) is how confidence is represented for Stanbol >> Enhancements. Related to this I would like to discuss the following >> two suggestions: >> >> ### Suggestion 1: Require confidence values to be in the range [0..1] >> >> This is an long going discussion, but I would really like to add a >> check that enforces confidence values to be in the range between >> [0..1]. >> >> I think this change is necessary, because it moves the responsibility >> for interpreting confidence values from the Stanbol users to the >> implementors of the Engines. I know that providing confidence values >> is a hard thing to do, but while it may be hard for Engine developers >> it is near to impossible to Stanbol users to do so. >> >> Note that EnhancementEngine would still be free to create Enhancements >> with no "fise:confidence" value. >> >> Surprisingly a lot of the existing Engines do already confirm to this >> rule. The most prominent exception is the Named Entity Tagging Engine >> (o.a.s.enhancer.engine.entitytagging). Because of this I implemented >> already an algorithm that normalizes confidence values by a >> combination of the levenshtein distance (selected-text <-> entity >> label) and the Solr result score for the Entity (see STANBOL-624 for >> details). >> >> If we could agree to this rule I would use a similar approach also for >> other Engines that do not yet normalize confidence values between >> [0..1] >> >> ### Suggestion 2: Add fise:confidence-level property >> >> The "confidence-level" is intended to make it easier for clients to >> decide how to process Enhancements. It would not use a numerical range >> but four distinct values: >> >> * confident: Meaning that a match is very likely - indicating that >> those annotations typically can be accepted automatically (e.g. If the >> EntityLinking engine finds a single Entity that exactly matches the >> text selected by an text annotation) >> * ambiguous: Meaning that there are several possibilities but is is >> still likely that one of them match (e.g. Paris, Paris (Texas)) >> * suggestion: Meaning that the match is not completely certain, but >> there are not several options (e.g. Germans -> Germany) >> * uncertain: Meaning that Entities do match, but the probability of a >> match is rather speculative (e.g. John -> Elton John) >> >> IMHO using this classification would fit a lot of engines much better >> as the numeric "fise:confidence" property as it does not rise the >> expectation in users that confidence values are on a rational scale >> (e.g. a Enhancement with a confidence of "0.8" is not two times as >> likely as one with "0.4"). >> >> Engines would have the possibility to manually add those information >> to enhancements. For enhancements that do not define those we could >> implement an post-processing engine that adds those based on generic >> rules. >> >> e.g. >> >> * ignore Enhancements with an existing "confidence-level" assignment >> * TextAnnotations with a confidence value > 0.8 => confident >> * TextAnnotations with a confidence value < 0.8 > 0.5 => suggestion >> * TextAnnotations with a confidence value < 0.5 => uncertain >> * TextAnnotations with a single linked EntityAnnotation with a >> confidence > 0.8 => confident >> * TextAnnotations with a several linked EntityAnnotation with a >> confidence > 0.8 => ambiguous *) >> * TextAnnotations with several linked EntityAnnotations with a >> confidence > 0.5 but no one > 0.8 => ambiguous *) >> * TextAnnotations with a single linked EntityAnnotation with a >> confidence < 0.8 > 0.5 => suggestion >> * TextAnnotations with EntityAnnotations with confidence values < 0.5 >> => uncertain >> * TopicAnnotation with a confidence value > 0.8 => confident >> * TopicAnnotation with a confidence value < 0.8 > 0.5 => suggestion >> * TopicAnnotation with a confidence value < 0.5 => uncertain >> >> *) NOTE that in those cases only EntityAnnotations with a confidence >> value > 0.5 would be marked as "ambiguous". Additional >> EntityAnnotations with confidence values < 0.5 would be marked as >> "uncertain" >> >> The values '0.8' and '0.5' should be configurable. >> >> Note that "fise:confidence-level" could be also used by Engines that >> can not provide fise:confidence values (E.g. the langid engine could >> mark detected languages as "uncertain" if the parsed text was to >> short). >> >> WDYT >> Rupert >> >> >> -- >> | Rupert Westenthaler [email protected] >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen > -- Fabian http://twitter.com/fctwitt
