Re: Update on the CELI enhancement engines (STANBOL-583) - FIXED PROBLEMS WITH FRENCH NER AND ADDED ITALIAN NER

Rupert Westenthaler Wed, 30 May 2012 09:28:39 -0700

Hi Alessio

On 30.05.2012, at 16:21, Alessio Bosca wrote:


> Hi Rupert, Stanbol community,
> 
> we are happy to finally announce that have fixed the problem with the NER 
> engine commented in STANBOL-583 and as well added support for Italian Named 
> Entity Recognition.
> I've posted a patch on 
> https://issues.apache.org/jira/browse/STANBOL-583?focusedCommentId=13285682#comment-13285682
>  however I also included it in the attachments.

Cool. That should mean that I can finalize the Engines tomorrow and merge the 
CELI engines back into trunk.

> Concerning the comment of Rupert on the pourpose of the TextAnnotation added 
> by the Lemmatizer component if "completeMorphoAnalysis" option is deactivated.
> The component in that case doesn't provide a morphological analysis token by 
> token instead it returns the lemmatized version of the whole textual content, 
> replacing each textual token with is lemma form.
> I.e. I'm booking two tickets -> I be book two ticket

OK that makes sense. I have not understood the meaning based on the Example 
used by the Unit Test. Not as easy if you do not speak the language of the 
example ^^

> If you think that this feature is not useful I could remove it in order to 
> remove unnecessary configurations.
> Let me know

This is definitely useful and produces a lot less triples as if 
completeMorphoAnalysis is activated. So I would suggest to keep this feature

best
Rupert

> Bests,
>    Alessio
> 
> On 05/19/2012 08:19 PM, Rupert Westenthaler wrote:
>> Hi Alessio, Stanbol community
>> 
>> Before I start, the current state of the things described in this Mail
>> can be found in the CELI Engine branch
>> 
>>     
>> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/
>> 
>> I made good progress on this issue this week. But most of the work was
>> not directly on the CELI engines but rather on making Stanbol ready
>> for the new Engines ^^
>> 
>> A lot of small things where not explicitly specified (e.g. language
>> annotations STANBOL-613; TopicEnhancements STANBOL-617). This is not a
>> big deal if there is only a single Engine that provides this feature,
>> but as soon as there are multiple one needs to ensure compatibility to
>> give users more freedom when they configure their EnhancementChains.
>> 
>> This changes should ensure that users can easily use one/several/all
>> of the CELI Engines and - even more important - combine them with all
>> the existing Stanbol EnhancementEngines.
>> 
>> In addition I have added a new Utility class that can be used in Unit
>> tests for EnhancementEngines to validate the created Enhancements (see
>> STANBOL-612). The new EnhancementStructureHelper class is part of the
>> "o.a.s.enhancer.test" test module and in the meantime used by most of
>> the Stanbol Enhancement engines (including all CELI engines)
>> 
>> In the following I provide an overview about the changes and the
>> current state of the Engines
>> 
>> (1) General Changes (valid for all Engines)
>> 
>> * Error Handling: EnhancementEngine MUST NOT catch exceptions that
>> influence EhancementResults. Users can configure in EnhancementChains
>> if an Engine is optional or required and the EnhancementJobManager
>> needs to take care of this. If Engines to catch Exceptions than the
>> EnhancementJobManager is missing the required Information
>> 
>> * Read/Write locks: EnhancementEngines that use "ENHANCE_ASYNC" need
>> to use read and write locks when accessing the ContentItem.
>> 
>> * HTTP clients: I changed the clients so that they do no longer create
>> in-memory copies of the content and the enhancement results. I know
>> some users that do send pdf documents with 100+ pages to Stanbol and
>> for such cases it is good to avoid an in-memory copy of 100 pages XML
>> escaped string.
>> 
>> * fise:selection-context: This property was missing but it is critical
>> for re-finding the exact location of an TextAnnotation within
>> non-plain-text systems (e.g. the http://hallojs.org/annotate.html
>> demo). As the CELI services do not provide this I added an
>> implementation that uses 50 char before/after the selected text to
>> create the context.
>> 
>> To make my changes easier to understand I added detailed inline NOTES
>> describing those changes to the CELI classification Engines. For the
>> other engines those notes are not present.
>> 
>> (2) Language Identification - READY : Annotates the language as
>> described by STANBOL-613. This even provides a confidence for the
>> detected language! Could even provide confidences for other languages
>> (currently not used).
>> 
>> (3) Lemmatizer - FUNCTIONAL :
>> 
>> I do fully understand the "completeMorphoAnalysis" mode. However I do
>> not understand for what one would use the TextAnnotation added if
>> "completeMorphoAnalysis" is deactivated.
>> 
>> NOTES
>> 
>>  * this engine uses two properties "fise:hasLemmaForm" and
>> "fise:hasMorphologicalFeature" ad morphological features are encoded
>> as "{KEY}={VALUE}" (e.g. "GENDER=FEM", "POS=NF", "NUMBER=PLU"). While
>> this is OK with me for getting things started this is definitely
>> something that could be improved on.
>> 
>> * if "completeMorphoAnalysis" is activated this Engine will create a
>> fise:TextAnnotation for each single word. Resulting in 10 - 15
>> triples/word. So this Engine might create troubles for long texts.
>> 
>> 
>> (4) NER engine - NOT FUNCTIONAL
>> 
>> * The issues described in the last comment of STANBOL-583 [1] still persist.
>> 
>> If those are solved this engine should be ready to be used.
>> 
>> (5) Classification engine - FUNCTIONAL
>> 
>> I aligned this engine, the Zemanta engine and the topic engine to the
>> same enhancement model (see STANBOL-617). In order to do that I needed
>> to change some things:
>> 
>> One "<return>{classification}</return>" as returned by the CELI
>> service is now mapped to one fise:TopicEnhancement. The "label"
>> element is used as fise:entity-label of the topic and teh
>> fise:entity-reference is set to the most specific dbpedia ontology
>> class referenced by the "label" element (see comments in the
>> ClassificationClientHTTP client for details).
>> I am not completely sure about those assumptions. So feedback on that
>> is highly welcome!
>> 
>> (6) TODOs:
>> 
>> I think the main thing is to get rid of the two bugs of the NER
>> engine. After that I think we can add the CELI engines to the Stanbol
>> code base.
>> 
>> best
>> Rupert Westenthaler
>> 
>> [1] 
>> https://issues.apache.org/jira/browse/STANBOL-583?focusedCommentId=13275235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13275235
>> (why need perma links to jira issues be so long ...)
> 
> 
> -- 
> *************************************
> Alessio Bosca, Ph.D.
> CELI s.r.l.
> Via San Quintino 31
> 10121 Torino
> Tel. +39 011.562.71.15
> Fax +39 011.506.40.86
> http://www.celi.it
> *************************************
> 
> <celiPatchNER.patch>

Re: Update on the CELI enhancement engines (STANBOL-583) - FIXED PROBLEMS WITH FRENCH NER AND ADDED ITALIAN NER

Reply via email to