Hi all,

Earlier today I committed the first version of the
KeywordLinkingEngine. This is basically a re-implementation of the
TaxonomyLinkingEngine I presented during the Paris IKS Community
Workshop.

This mail has two parts: First a short description for Early Adopters
that want to try out this engine and second a description targeted at
Stanbol developers interested in the internals of this new engine.

(1) For Users:

Feature wise this engine is very similar to the TaxonomyLinkingEngine
however there are some improvements.

First and most important is has support for multiple languages. During
the development this engine was heavily tested with English and German
texts. I have also made some tests with Spanish, French and Italian
news articles that also looked fine. However I do not speak such
languages therefore it is hard for me to validate results.

To give this new engine a try

* start the Full Launcher. This already includes the KeywordExtractionEngine
* go to http://localhost:8080/system/console/components and search for
"KeywordLinkingEngine".
* click on the configuration button and add the name of a
referencedSite in the first line. For a first test you can use the
ReferencedSite "dbpedia" initialized by the default configuration of
the Launcher. However note that this only includes the 40k most famous
Entities of Wikipedia therefore enhancement results will be limited.
* To test this engine is is best to deactivate most of the other
engines. However make sure that the "LangIdEnhancementEngine" is
active, because it extracts the language of parsed texts.
* go to http://localhost:8080/engines and copy+paste some texts.

If you like to test this with a bigger DBpedia index you can download
one from [2]. In any case you need to copy the **.solrindex.zip file
into the "{stanbol-launcher-root}/sling/datafiles" and than install
the "org.apache.stanbol.data.site.**.1.0.0.jar" in the Bundle tab of
the Apache Felix Webconsole
(http://localhost:8080/system/console/bundles). In case you install
the dbpedia index you should also first stop/uninstall the
"org.apache.stanbol.data.sites.dbpedia.default" bundle.

The dbpedia index at [2] contains labels for en, de, it, es, fr and
es. The default index that comes with the launcher includes also ar,
da, fi, no, pt, ru, sv, tr, zh.

For persons that like to test german language text I suggest to also
install the

In the coming days Andreas Gruber will release some blog posts
providing a much more detailed user level introduction to this new
engine - stay tuned.

(2) For Developers:

The main motivation for this re-implementation was to make it more
modular(see also STANBOL-303 [1]).
The current version focuses on the separation of the NLP part and the
EntityLookup part from the implementation of the extraction process.
The matching of the Labels with the text and the post processing of
matched Entities (e.g. following redirects, calculating confidences)
are still part of the extraction process, but could be also
externalized via additional interfaces.

The technical documentation of this engine will be available at [3].

Feedback welcome!

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-303
[2] http://dev.iks-project.eu/downloads/stanbol-indices/
[3] 
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html

-- 
| Rupert Westenthaler                            [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to