Hi Iavor,

If you check out the branch

    http://svn.apache.org/repos/asf/incubator/stanbol/branches/dbpedia-spotlight-engines/

you will find the merged version of the DBpedia Spotlight EnhancementEngines under

   engines/dbpedia-spotlight/

With revision http://svn.apache.org/viewvc?rev=1376912&view=rev I have made the following changes

* moved all Engines in a single Module
* Parameters are now shared between all of them
* Domain Model (annotation, surfaceform, candidates) are shared
* added utility methods for writing enhancements

All those changes where mainly restructuring of code and removing duplicates.

In addition to that I have also made some changes

* Requests/Responses to the RESTful services are now handled differently to avoid creating in-memory copies of request/response data. However NOTE that the request data (the text of the contentItem) is still two times in memory (text and URL encoded version). This can not easily avoided as long as "application/x-www-form-urlencoded" is used to communicate with the server.
* Added code to the "Annotation" class that extracts the most generic dbpedia-ont Class from the types. This code has a lot of assumptions and NEED to be validated (See comments in the class).
* Added functionality to extract the fise:selection-context for created fise:TextAnnotation (this was a TODO in the contributed version)
* Added unit test that validate the written Enhancements for each of the Engines
* Ensured that the engines are deactivated if Stanbol runs in OfflineMode (by adding a @Reference to OnlineMode)
* Added a default configuration for an "dpbedia-spotlight" EnhancementChain that is automatically deployed with the bundle. This uses "metaxa;options, tike;optional, langdetect, dbpspotlightannotate".

In addition STANBOL-717 solves the default Configuration Issue

Open Issues (sorted by importance)

1. determine the "dc:type" value for TextAnnotation. Currently I try to use the most generic dbpedia-ont class. First I am not sure if this is a good idea and second also the code for extracting this type (see above) makes a lot of assumptions.
2. data for suggested Entities: Created fise:EntityAnnotations do not have the correct "fise:entity-label", but do use the SurfaceForm. Also the "fise:entity-type" values (the rdf:type values of the suggested Entity) seam sometimes to divagate from the list returned by dbpedia.org. Can Spotlight provide Entity data? If not I was thinking about a "dereference entity" option that downloads the entity data form dbpedia.org instead of using the information within the spotlight response.
3. Is it possible that the annotate and disambiguate Engine does return multiple suggestions for a fise:TextAnnoation (spotted Entity). Stanbol Enhancer users are used to get multiple suggestions. So even that "disambiguation" does re-rank suggestions there is no harm if multiple are returned.
4. Spotter: There are several different possibilities (NER, LingPipeSpotter, OpenNLPChunkerSpotter and Kea). I was thinking to include those options as preconfigured options (human read-able name and description) instead of a simple String field.

best
Rupert
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to