Hi all, Earlier today I committed the first version of the KeywordLinkingEngine. This is basically a re-implementation of the TaxonomyLinkingEngine I presented during the Paris IKS Community Workshop.
This mail has two parts: First a short description for Early Adopters that want to try out this engine and second a description targeted at Stanbol developers interested in the internals of this new engine. (1) For Users: Feature wise this engine is very similar to the TaxonomyLinkingEngine however there are some improvements. First and most important is has support for multiple languages. During the development this engine was heavily tested with English and German texts. I have also made some tests with Spanish, French and Italian news articles that also looked fine. However I do not speak such languages therefore it is hard for me to validate results. To give this new engine a try * start the Full Launcher. This already includes the KeywordExtractionEngine * go to http://localhost:8080/system/console/components and search for "KeywordLinkingEngine". * click on the configuration button and add the name of a referencedSite in the first line. For a first test you can use the ReferencedSite "dbpedia" initialized by the default configuration of the Launcher. However note that this only includes the 40k most famous Entities of Wikipedia therefore enhancement results will be limited. * To test this engine is is best to deactivate most of the other engines. However make sure that the "LangIdEnhancementEngine" is active, because it extracts the language of parsed texts. * go to http://localhost:8080/engines and copy+paste some texts. If you like to test this with a bigger DBpedia index you can download one from [2]. In any case you need to copy the **.solrindex.zip file into the "{stanbol-launcher-root}/sling/datafiles" and than install the "org.apache.stanbol.data.site.**.1.0.0.jar" in the Bundle tab of the Apache Felix Webconsole (http://localhost:8080/system/console/bundles). In case you install the dbpedia index you should also first stop/uninstall the "org.apache.stanbol.data.sites.dbpedia.default" bundle. The dbpedia index at [2] contains labels for en, de, it, es, fr and es. The default index that comes with the launcher includes also ar, da, fi, no, pt, ru, sv, tr, zh. For persons that like to test german language text I suggest to also install the In the coming days Andreas Gruber will release some blog posts providing a much more detailed user level introduction to this new engine - stay tuned. (2) For Developers: The main motivation for this re-implementation was to make it more modular(see also STANBOL-303 [1]). The current version focuses on the separation of the NLP part and the EntityLookup part from the implementation of the extraction process. The matching of the Labels with the text and the post processing of matched Entities (e.g. following redirects, calculating confidences) are still part of the extraction process, but could be also externalized via additional interfaces. The technical documentation of this engine will be available at [3]. Feedback welcome! best Rupert [1] https://issues.apache.org/jira/browse/STANBOL-303 [2] http://dev.iks-project.eu/downloads/stanbol-indices/ [3] http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
