[jira] [Updated] (STANBOL-583) CELI enhancement engine(s) - Contribution to stanbol

Rupert Westenthaler (JIRA) Mon, 23 Apr 2012 01:26:28 -0700

     [ 
https://issues.apache.org/jira/browse/STANBOL-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rupert Westenthaler updated STANBOL-583:
----------------------------------------

    Attachment: STANBOL-583-celi-engines_20120423_rwesten.patch

NOTE: The originally attached zip archive was not a patch, but an archive of 
the source tree. Because this adds a new EnhancementEngine I was still able to 
correctly apply it by extracting the archive, copying to the /enhancer/engine 
and removing all svn metadata.


Created a new Patch that includes the following changes

* Applied some minor changes necessary to compile with recent changes within 
the trunk.
* Dependencies
    * changed dependencies of the Apache commons httpclient to the OSGI bundle 
version "httpclient-osgi"
    * removed the unused dependency to OpenNLP
    * now there are no embedded dependencies
* Logging
    * changed Logger API from Apache log4j to SLF4J - the logging Framework 
used by Apache Stanbol. 
    * Loggings in the test still use log4j via SLF4J

TODOs/Questions:

1. Stanbol EnhancementEngine MUST support "offline mode": This ensures that no 
connections to external services are made if Stanbol is started in offline mode 
(-Dorg.apache.stanbol.offline.mode=true). EnhancementEngines that do require an 
external service need than to deactivate themself. This is easiest achieved by 
adding

    @Reference
    private OnlineMode onlineMode;

as the OnlineMode service will only be available if OfflineMode is deactivated. 

You will also need to add

    <dependency>
        <groupId>org.apache.stanbol</groupId>
        <artifactId>org.apache.stanbol.commons.stanboltools.offline</artifactId>
        <scope>provided</scope>
    </dependency>

2. While all unit tests succeed I noticed exceptions like

    com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
Invalid byte 1 of 1-byte UTF-8 sequence.
        at 
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:684)
        ...

indicating that the char encoding used by the received data is not UTF8. In 
fact the responses of the service do not specify any encoding

    <?xml version="1.0" ?><S:Envelope 
xmlns:S="http://schemas.xmlsoap.org/soap/envelope/";><S:Body><ns2:guessLanguageResponse
 ...

However I think this is related to how the request data is processed by 
**ClientHTTP.java classes.

* the "doPost(..)" method returns a String and uses "UTF-8" for parsing the 
String from the received bytes. So far so good
* the calling method than creates an other ByteArrayInputStream for the 
returned String by using String.getBytes(). This will create the byte[] 
representation of the String by using the Plattform encoding ("MAC Roman" in my 
case).
* This stream is than set to SOAPPart#setContent(...). Now I assume that 
because the XML string does not include a explicit charset this implementation 
will use UTF-8 to parse the "MAC Roman" encoded byte sequence.

I would suggest to change the doPost(..) method to return the InputStream and 
set this stream directly to SOAPPart#setContent(...).

3. I noticed that for each request a **ClientHTTP instance is created. I would 
rather expect a single instance to be created during the engine activation or 
do I miss a good reason why it is better to create a new instance for each 
enhancement request?

4. The ClassificationClientHTTP uses "ns2:label" and "ns2:score" to access the 
data. This seams dangerous  as the used prefixes may depend on the used XML 
framework and those might change over time.  I would suggest to explicitly 
refer to the namespace "http://linguagrid.org/v20110204/commons"; instead.

Alessio Bosca can you please 

1. validate that the my changes do work with the current trunk
2. my changes in the dependencies do not break the engines
3. add support for Offline Mode
4. have a look at the char encoding issues I encountered

On my TODO list is

1. validation of the created RDF (TextAnnotations, EntityAnnotations, 
TopicAnnotations)
2. read/write locks on the ContentItem and the metadata (as you return 
"ENHANCE_ASYNC" in the canEnhance(..) method this is necessary)
3. testing the Engines on a Stanbol instance within a real EnhancementChain.
                
> CELI enhancement engine(s)  - Contribution to stanbol
> -----------------------------------------------------
>
>                 Key: STANBOL-583
>                 URL: https://issues.apache.org/jira/browse/STANBOL-583
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancer
>    Affects Versions: 0.9.0-incubating
>         Environment: Enhancement Engines developed as web service clients
>            Reporter: Alessio Bosca
>            Priority: Minor
>              Labels: patch
>             Fix For: 0.9.0-incubating
>
>         Attachments: STANBOL-583-celi-engines_20120423_rwesten.patch, celi.zip
>
>
> The services included so far in the module as Enhancement Engines are:
> - a Named Entity Recognition service for French
> - a Lemmatizer for Italian, German, Romanian, Russian, Danish  (it creates an 
> annotation on the document whose content is the lemmatized form of the 
> document)
> - a Language Identifier for Italian, French,German,Spanish, Portuguese, 
> Polish, Hungarian, Dutch, Swedish,Arabic, Russian,Turkish, Romanian, Greek, 
> Norwegian
> - a Document Classification services for Italian, French, German, English, 
> Spanish, Portuguese that associates a document to DBPedia classes 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-583) CELI enhancement engine(s) - Contribution to stanbol

Reply via email to