Hi Alessio On 30.05.2012, at 16:21, Alessio Bosca wrote:
> Hi Rupert, Stanbol community, > > we are happy to finally announce that have fixed the problem with the NER > engine commented in STANBOL-583 and as well added support for Italian Named > Entity Recognition. > I've posted a patch on > https://issues.apache.org/jira/browse/STANBOL-583?focusedCommentId=13285682#comment-13285682 > however I also included it in the attachments. Cool. That should mean that I can finalize the Engines tomorrow and merge the CELI engines back into trunk. > Concerning the comment of Rupert on the pourpose of the TextAnnotation added > by the Lemmatizer component if "completeMorphoAnalysis" option is deactivated. > The component in that case doesn't provide a morphological analysis token by > token instead it returns the lemmatized version of the whole textual content, > replacing each textual token with is lemma form. > I.e. I'm booking two tickets -> I be book two ticket OK that makes sense. I have not understood the meaning based on the Example used by the Unit Test. Not as easy if you do not speak the language of the example ^^ > If you think that this feature is not useful I could remove it in order to > remove unnecessary configurations. > Let me know This is definitely useful and produces a lot less triples as if completeMorphoAnalysis is activated. So I would suggest to keep this feature best Rupert > Bests, > Alessio > > On 05/19/2012 08:19 PM, Rupert Westenthaler wrote: >> Hi Alessio, Stanbol community >> >> Before I start, the current state of the things described in this Mail >> can be found in the CELI Engine branch >> >> >> http://svn.apache.org/repos/asf/incubator/stanbol/branches/celi-enhancement-engines/ >> >> I made good progress on this issue this week. But most of the work was >> not directly on the CELI engines but rather on making Stanbol ready >> for the new Engines ^^ >> >> A lot of small things where not explicitly specified (e.g. language >> annotations STANBOL-613; TopicEnhancements STANBOL-617). This is not a >> big deal if there is only a single Engine that provides this feature, >> but as soon as there are multiple one needs to ensure compatibility to >> give users more freedom when they configure their EnhancementChains. >> >> This changes should ensure that users can easily use one/several/all >> of the CELI Engines and - even more important - combine them with all >> the existing Stanbol EnhancementEngines. >> >> In addition I have added a new Utility class that can be used in Unit >> tests for EnhancementEngines to validate the created Enhancements (see >> STANBOL-612). The new EnhancementStructureHelper class is part of the >> "o.a.s.enhancer.test" test module and in the meantime used by most of >> the Stanbol Enhancement engines (including all CELI engines) >> >> In the following I provide an overview about the changes and the >> current state of the Engines >> >> (1) General Changes (valid for all Engines) >> >> * Error Handling: EnhancementEngine MUST NOT catch exceptions that >> influence EhancementResults. Users can configure in EnhancementChains >> if an Engine is optional or required and the EnhancementJobManager >> needs to take care of this. If Engines to catch Exceptions than the >> EnhancementJobManager is missing the required Information >> >> * Read/Write locks: EnhancementEngines that use "ENHANCE_ASYNC" need >> to use read and write locks when accessing the ContentItem. >> >> * HTTP clients: I changed the clients so that they do no longer create >> in-memory copies of the content and the enhancement results. I know >> some users that do send pdf documents with 100+ pages to Stanbol and >> for such cases it is good to avoid an in-memory copy of 100 pages XML >> escaped string. >> >> * fise:selection-context: This property was missing but it is critical >> for re-finding the exact location of an TextAnnotation within >> non-plain-text systems (e.g. the http://hallojs.org/annotate.html >> demo). As the CELI services do not provide this I added an >> implementation that uses 50 char before/after the selected text to >> create the context. >> >> To make my changes easier to understand I added detailed inline NOTES >> describing those changes to the CELI classification Engines. For the >> other engines those notes are not present. >> >> (2) Language Identification - READY : Annotates the language as >> described by STANBOL-613. This even provides a confidence for the >> detected language! Could even provide confidences for other languages >> (currently not used). >> >> (3) Lemmatizer - FUNCTIONAL : >> >> I do fully understand the "completeMorphoAnalysis" mode. However I do >> not understand for what one would use the TextAnnotation added if >> "completeMorphoAnalysis" is deactivated. >> >> NOTES >> >> * this engine uses two properties "fise:hasLemmaForm" and >> "fise:hasMorphologicalFeature" ad morphological features are encoded >> as "{KEY}={VALUE}" (e.g. "GENDER=FEM", "POS=NF", "NUMBER=PLU"). While >> this is OK with me for getting things started this is definitely >> something that could be improved on. >> >> * if "completeMorphoAnalysis" is activated this Engine will create a >> fise:TextAnnotation for each single word. Resulting in 10 - 15 >> triples/word. So this Engine might create troubles for long texts. >> >> >> (4) NER engine - NOT FUNCTIONAL >> >> * The issues described in the last comment of STANBOL-583 [1] still persist. >> >> If those are solved this engine should be ready to be used. >> >> (5) Classification engine - FUNCTIONAL >> >> I aligned this engine, the Zemanta engine and the topic engine to the >> same enhancement model (see STANBOL-617). In order to do that I needed >> to change some things: >> >> One "<return>{classification}</return>" as returned by the CELI >> service is now mapped to one fise:TopicEnhancement. The "label" >> element is used as fise:entity-label of the topic and teh >> fise:entity-reference is set to the most specific dbpedia ontology >> class referenced by the "label" element (see comments in the >> ClassificationClientHTTP client for details). >> I am not completely sure about those assumptions. So feedback on that >> is highly welcome! >> >> (6) TODOs: >> >> I think the main thing is to get rid of the two bugs of the NER >> engine. After that I think we can add the CELI engines to the Stanbol >> code base. >> >> best >> Rupert Westenthaler >> >> [1] >> https://issues.apache.org/jira/browse/STANBOL-583?focusedCommentId=13275235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13275235 >> (why need perma links to jira issues be so long ...) > > > -- > ************************************* > Alessio Bosca, Ph.D. > CELI s.r.l. > Via San Quintino 31 > 10121 Torino > Tel. +39 011.562.71.15 > Fax +39 011.506.40.86 > http://www.celi.it > ************************************* > > <celiPatchNER.patch>
