Hi Jairo thanks for your feedback regarding the disambiguation engine
On Fri, Nov 9, 2012 at 6:51 PM, Jairo Sarabia <jairo.sara...@appstylus.com> wrote: > I'm Jairo Sarabia, a web developer at Notedlinks S.L. from Barcelone > (Spain). > We're very interested on Apache Stanbol and we would like to know how > Stanbol works internally, so how works the framework is used, the directory > structure and how works files of configuration. > Is there any documentation about these? Could you send me? > For the Stanbol Enhancer there is a Developer level documentation available. http://stanbol.apache.org/docs/trunk/components/enhancer/ is the starting point. The Section "Main Interfaces and Utility Classes" links to the description of the different components. > Meanwhile, thank and congratulate you because we tested the disambiguation > engine and we liked the improved responses in English, although I understand > that the quality is still regularly in some respects. Especially with topics > of Person and Organizations, so most times only detects part of the name and > especially in compound words, and this makes the disambiguation is wrong. This is probably because the disambiguation Engine does not refine the fise:selected-text of the fise:TextAnnotation based on disambiguation results. Can you provide some examples of this behavior so that I can validate this assumption. > We would like to know about future plans for the disambiguation engine, and > whether it can be used for other languages. Stanbol is a community driven Project. The engine itself was developed by Kritarth Anand in a GSoC project [1] and contributed to Stanbol with STANBOL-723 [2]. I am was mentoring this project. I do not know Kritarth plans, but personally I plan is to continue work on this engine as soon as I have finished - meaning re-integrated the Stanbol NLP module with the trunk. This work will mainly focus on making the MLT disambiguation engine configureable and testing that it works well with the new Stanbol NLP processing module (STANBOL-733). [1] http://www.google-melange.com/gsoc/project/google/gsoc2012/kritarth/12001 [2] https://issues.apache.org/jira/browse/STANBOL-723 > > Finally, we would like to know if it is possible to create multilingual > DBpedia indexes and then the responses link to the Dbpedia on the language > of the text. For example, if the text is on Spanish language then the > literals founded have relations to resources to the Spanish Dbpedia (not > English Dbepdia resources). > And if its possible could you explain me how to do it. The disambiguation-mlt engine is not language specific. Principally it works with any Entityhub Site and any language where a disambiguation context is available. AFAIK the currently hard coded configuration uses the full-text field (that contains texts in any lanugages) for the Solr MLT query. The 1Gbyte Solr index you probably use for disambiguation includes short abstracts only for English. Long abstracts are not included for any language. This is also the reason why you are not getting disambiguation results for other languages as English. A better suited environment would provide short (or even long) abstracts for the language you want to disambiguate. The configuration of the Engine would not use the all-language full text field for the MLT queries, but instead the language specific one. The reason why such information are not included in the distributed index is simple to reduce its size. In addition when this index was created there was not yet an engine such as the disambiguation-mlt one that would have consumed those information. I have already created an DBpedia 3.8 based index that includes a lot of information useful for disambiguation for several languages. However this index in its current form is not easily shared as it is about ~100GByte (45Gbyte compressed) is size. In addition I had not yet time to validate the index (as indexing only completed shortly before I left for ApacheCon last week). Anyway I will use this index as base for further work on the disambiguation-mlt engine. I will also share the used Entityhub indexing tool configuration and try to come up with an modified configuration that is about 10GByte in size but still useful for disambiguation with the MLT based engine. best Rupert > > That's all! and Thank you very much again! > > Best, > > Jairo Sarabia -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen