[
https://issues.apache.org/jira/browse/STANBOL-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler updated STANBOL-583:
----------------------------------------
Attachment: STANBOL-583-celi-engines_20120423_rwesten.patch
NOTE: The originally attached zip archive was not a patch, but an archive of
the source tree. Because this adds a new EnhancementEngine I was still able to
correctly apply it by extracting the archive, copying to the /enhancer/engine
and removing all svn metadata.
Created a new Patch that includes the following changes
* Applied some minor changes necessary to compile with recent changes within
the trunk.
* Dependencies
* changed dependencies of the Apache commons httpclient to the OSGI bundle
version "httpclient-osgi"
* removed the unused dependency to OpenNLP
* now there are no embedded dependencies
* Logging
* changed Logger API from Apache log4j to SLF4J - the logging Framework
used by Apache Stanbol.
* Loggings in the test still use log4j via SLF4J
TODOs/Questions:
1. Stanbol EnhancementEngine MUST support "offline mode": This ensures that no
connections to external services are made if Stanbol is started in offline mode
(-Dorg.apache.stanbol.offline.mode=true). EnhancementEngines that do require an
external service need than to deactivate themself. This is easiest achieved by
adding
@Reference
private OnlineMode onlineMode;
as the OnlineMode service will only be available if OfflineMode is deactivated.
You will also need to add
<dependency>
<groupId>org.apache.stanbol</groupId>
<artifactId>org.apache.stanbol.commons.stanboltools.offline</artifactId>
<scope>provided</scope>
</dependency>
2. While all unit tests succeed I noticed exceptions like
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
at
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
...
indicating that the char encoding used by the received data is not UTF8. In
fact the responses of the service do not specify any encoding
<?xml version="1.0" ?><S:Envelope
xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns2:guessLanguageResponse
...
However I think this is related to how the request data is processed by
**ClientHTTP.java classes.
* the "doPost(..)" method returns a String and uses "UTF-8" for parsing the
String from the received bytes. So far so good
* the calling method than creates an other ByteArrayInputStream for the
returned String by using String.getBytes(). This will create the byte[]
representation of the String by using the Plattform encoding ("MAC Roman" in my
case).
* This stream is than set to SOAPPart#setContent(...). Now I assume that
because the XML string does not include a explicit charset this implementation
will use UTF-8 to parse the "MAC Roman" encoded byte sequence.
I would suggest to change the doPost(..) method to return the InputStream and
set this stream directly to SOAPPart#setContent(...).
3. I noticed that for each request a **ClientHTTP instance is created. I would
rather expect a single instance to be created during the engine activation or
do I miss a good reason why it is better to create a new instance for each
enhancement request?
4. The ClassificationClientHTTP uses "ns2:label" and "ns2:score" to access the
data. This seams dangerous as the used prefixes may depend on the used XML
framework and those might change over time. I would suggest to explicitly
refer to the namespace "http://linguagrid.org/v20110204/commons" instead.
Alessio Bosca can you please
1. validate that the my changes do work with the current trunk
2. my changes in the dependencies do not break the engines
3. add support for Offline Mode
4. have a look at the char encoding issues I encountered
On my TODO list is
1. validation of the created RDF (TextAnnotations, EntityAnnotations,
TopicAnnotations)
2. read/write locks on the ContentItem and the metadata (as you return
"ENHANCE_ASYNC" in the canEnhance(..) method this is necessary)
3. testing the Engines on a Stanbol instance within a real EnhancementChain.
> CELI enhancement engine(s) - Contribution to stanbol
> -----------------------------------------------------
>
> Key: STANBOL-583
> URL: https://issues.apache.org/jira/browse/STANBOL-583
> Project: Stanbol
> Issue Type: New Feature
> Components: Enhancer
> Affects Versions: 0.9.0-incubating
> Environment: Enhancement Engines developed as web service clients
> Reporter: Alessio Bosca
> Priority: Minor
> Labels: patch
> Fix For: 0.9.0-incubating
>
> Attachments: STANBOL-583-celi-engines_20120423_rwesten.patch, celi.zip
>
>
> The services included so far in the module as Enhancement Engines are:
> - a Named Entity Recognition service for French
> - a Lemmatizer for Italian, German, Romanian, Russian, Danish (it creates an
> annotation on the document whose content is the lemmatized form of the
> document)
> - a Language Identifier for Italian, French,German,Spanish, Portuguese,
> Polish, Hungarian, Dutch, Swedish,Arabic, Russian,Turkish, Romanian, Greek,
> Norwegian
> - a Document Classification services for Italian, French, German, English,
> Spanish, Portuguese that associates a document to DBPedia classes
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira