Hi I am writing this on the list, because JIRA is not so prominent and some interested might miss this discussion.
There are some comments inline and a longer discussion about the pro/con at the end. ---------- Forwarded message ---------- From: Florent ANDRE (JIRA) <[email protected]> Date: Thu, Jul 28, 2011 at 9:22 PM Subject: [jira] [Created] (STANBOL-303) EntityFetch engine To: [email protected] EntityFetch engine ------------------ Key: STANBOL-303 URL: https://issues.apache.org/jira/browse/STANBOL-303 Project: Stanbol Issue Type: Improvement Components: Enhancer Reporter: Florent ANDRE > Hi, > > I extracted "entity fetching" related code from taxonomylinking engine and > create a new engine based on. What do you use as input for this Engine? see (1) > I also make the query.addSelectedField() configurable by felix configuration. > +1 > This engine is runnable in ServiceProperties.ORDERING_EXTRACTION_ENHANCEMENT > position. > Entity lookup might need additional metadata that is currently not present in the metadata see (2) > I see 2 advantages of such an engine : > 1) users can develop an extraction engine without think about entity retrieve That a biggest PRO argument for splitting up Text Analysis and Entity Lookup in two engines and using the ContentItem.getMatadaa() as abstraction layer! > 2) if this engine provide helpful lib, entity fetching will easily be embed > into user's engine and limit code duplication for entity fetch. Providing only a tailored API / library for entity fetching as typically needed by enhancement engines. see (3) > Could it be an interesting engine for trunk ? > ++ First let me say, that is was not my plan to keep all this functionality within the TaxonomyLinkingEngine. The Idea was to start with implementing everything in a singe Class to validate the approach ((performance and result wise) and if the results are promising to refactor the implementation more generic. In fact I have already started with making the TextAnalysis part more generic. Basically building a simple API for TextAnalysis based on OpenNLP that is tailored for the needs of Enhancements engines. This will be part of the org.apache.stanbol.commons.opennlp bundle. As mentioned by this issue the same would make sense for the Entity retrieving part. So a big +1 from my side. (1) TextAnalyses results: The amount of data resulting form the text analyses is very different. If you use NER (Named Entity Recognition) you get only a limited number of Results that can easily converted to an RDF graph and added to the metadata of the ContentItem. However if you want to use Words, POS tagging and Chunker the amount of the resulting information is much higher. Encoding all this as RDF and adding it to the metadata may have performance and usability implications. Assuming a text with 2000 words one could expect 20 TextEnhancements when using NER but 200+ Chunks with 500+ words with interesting POS tags. Performance wise this will make the processing of the metadata in follow up engines slower but it will also require to provide some functionality - post processing engine - that allows filter most of such enhancements before sending the results back to the user. If both text analyses and entity lookup are done in the same engine it is much easier to optimize. e.g. processes the TaxonomyLinkingEngine the content sentence wise. Therefore only the text analysis results of the current sentence need to be kept in memory and TextAnnotations are only created for words/chunks that are linked to an Entity. (2) Using the Taxonomy to improve TextAnalyses results First tests (with english language) has shown that POS tagging works very well, but the performance of the Chunker is questionable. In general building chunks manually based on POS tags worked much better in most of the cases. Based on that I assume that in most of the cases the best approach would be to 1. use Words and POS tags as input 2. build chunks proposals based on the POS tags 3. lookup Entities with all nouns of the proposed chunk. All such nouns would be optional (this was the reason for implementing STANBOL-297) 4. based on the returned Entities search for the best match in the surrounding text (even outside of the proposed chunk) However to implement 4 the entity fetching part would need access to the results of the word tokenizer (2000 TextAnnotations for a document with 2000 words). (3) APIs for TextAnalyses and EntityFetching tailored to the requirements of EnhancementEngine Developers Because of this my conclusion was that it would be the best to first work on APIs that ease the development of Engines that * need to analyze natural language text * need to lookup entities from the entityhub So basically in case of the TaxonomyLinking Engine there would still be only a single Engine but the amount of code would be greatly reduced because it can use the tailored API for TextAnalyses and EntityFetching. In addition the NER engine (enhancer.engines.opennlp.ner) should be also changed to use the new TextAnalyses API and the NamedEntityTagging engine (enhancer.engines.entitytagging) should use the EntityFetching API. Basically this would mean that such APIs would support the development of both engines that do both text analyses and entity lookup as well as engines that do only text analyses or entity lookup. I have planed to work on the text analyses part in the coming time however because I am be on vacation the whole August I would not expect immediate results ^^ (4) Improving Metadata infrastructure In my opinion the best solution would be to split up text analyses and entity fetching in separate engines. However this would require to improve the way metadata are handled by the Enhancer infrastructure. This would include: * processing of chunks (e.g. pages, sections, sentences ...) to reduce the amount of data for big documents. This would also improve the processing of big documents. Have you ever tried to send a PDF with 80+ pages to the Enhancer? * filtering of enhancements so that users do not get enhancements that are only interesting during the enhancement process (unless the do not explicitly specify to get such intermediate results. WDYT Rupert -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
