Hi!

On 5 June 2012 11:44, Rupert Westenthaler <[email protected]> wrote:
> Hi Mihály
>
> An integration between Stanbol and UIMA would indeed be something very 
> useful. I will try to provide some pointers - especially related to the 
> Stanbol Enhancer  - in this mail. But because my own experience with UIMA is 
> limited to reading the documentation about two years ago I will not be able 
> to provide much input on the UIMA side of the task.
>
> On 04.06.2012, at 18:45, Mihály Héder wrote:
>> Hello Everyone,
>>
>> I'm new to this list, my name is Mihály Héder ; I am the lead
>> developer of Sztakipedia project:
>> http://www.youtube.com/watch?v=8VW0TrvXpl4
>>
>> Most of Sztakipedia's suggestions are based on UIMA Annoation Chains,
>> that are organized of UIMA Annotation Engines. This are similar stuff
>> to Enhancer Chains and Enhancement Engines, resp. If you are curious,
>> you can play around one of Sztakipedia's chains:
>> http://pedia.sztaki.hu:8080/tfidfengsb/?mode=form This is a
>> Tokenizer+Sentence boundary detector+lemmatizer+tf-idf calculator
>> chain (tf-idf is calculated on enwiki in this case)
>>
> [..]
>>
>> So right now I'm investigating how to integrate UIMA stuff into
>> Stanbol. After having read some Stanbol Docs and writing a Hello World
>> enhancement engine to get a grip on Stanbol, I think I this is how it
>> should be done:
>> -An adapter-like interface is needed that glues together two
>> components. If you use UIMA, most of the time you just have a pear
>> file from a third party that you cant/do not want to modify. It will
>> have its own type system, chain definition, etc. Also, hopefully there
>> will be much more Stanbol users than developers in the long run.
>> -This means that the real use case is that the future user downloads a
>> uima chain from somewhere, downloads stanbol, and want to glue the two
>> together without coding in either projects.
>> -However, most of the time it will be non-trivial to turn UIMA Feature
>> Sets to Stanbol Enhancements. In some cases I can imagine that you can
>> just turn every FS to a triple by a simple rule or something, but
>> making this flexible enough from some configuration files seems rather
>> unrealistic for me.
>>
>> So what I have in mind now about UIMA->Enhancement conversion is:
>> -defining a simple java interface with one function, e.g:  Triple
>> convertFStoTriple(org.apache.uima.cas.FeatureStructure fs). By
>> implemeting this one function the user could easily define how feature
>> structs are to be turned to Triples. Most of the time this function
>> would give back nulls as there are usually much more UIMA
>> FeatureStructures generated (e.g about two for every word) than the
>> user want to deal with.
>
> Dont forget the possibility to store the UIMA feature structure as 
> ContentPart to the Stanbol ContentItem. [1] I would suggest to define a fixed 
> URI as key so that all UIMA related stuff does know how to search for it.
> With the multipart ContentItem RESTful API users could even request the UIMA 
> feature structure via the Stanbol RESTful API.

Okay, I see that the ContentPart interface is designed for holding
this kind of stuff. Perfect!

>> -creating an Enhancement Engine called UIMAAdapter. This would have a
>> converterClass Service Property that could be configured to contain
>> the name of the class the user just created. This would instantiate
>> the user-written class, provided that its on the classpath, and use it
>> to create enhancements.
>
> In OSGI one would rather define an interface and register converters as 
> services. Services can be manually registered by using the BundleContext. An 
> alternative is to use "@Component" annotations - as in the case of 
> EnhancementEngines. In this case the OSGI config admin will automatically 
> create the component and register it as service.

Ok. What I had in mind is that there are at least two magnitude more
people that can implement a java interface than those who are
OSGI-savvy. But I guess, if we write a good doc with example codes
everyone should be able to handle it.

>> -for more advanced cases we could provide an interface to map a
>> List<FeatureStructure> to List<Triples>. For even more advanced cases
>> we could provide a convert(List<FeatureStructure>,ContentItem ci)
>> function with full access to the Stanbol ContentItem
>> -naturally we could write some default converter that converts every
>> FeatureStructure that comes out of UIMA to triples in a way for
>> testing purposes and for a basis of extension.
>
> I would suggest to separate two things:
>
> 1. calls an Engine that executes the UIMA Annotation Chain and stores the 
> results as ContentPart in the Stanbol ContentItem
> 2. one or more Engines that convert the UIMA results to Stanbol Enhancements

I think this separation is a great idea.

> one possibility would be to use an EnhancementChain for chaining (1) and (2).
>
> I would also expect different implementations of (2)
>
> * Fixed implementations for typical things contained in UIMA results
> * Configurable implementations that require users to provide the mappings
> * Generic implementations that mainly convert the UIMA results to RDF: Those 
> RDF might be further processed by an other StanbolEngine.
> * Special implementations optimized for special use cases. Those would need 
> to be created by Stanbol users or UIMA annotationChain providers.

UIMA only defines the FeatureStructure data model but its usage
content is up to the developers. SentenceAnnotation and
TokenAnnotation are still quite common though. So I will try to come
up with some configuration scheme for simpler mappings.

> however as my knowledge about typical UIMA results is very limited this might 
> also be not feasible.
>
>>
>> The other question is how to communicate with the UIMA Engine. I think
>> the feature of accessing a remotely deployed UIMA engine is a must and
>> the REST interface you can try out on the link above (provided by
>> UIMASimpleServlet) is good for starters. I'm much less sure that
>> embedding everything into a Stanbol Enhancement Engine that is needed
>> to run a UIMA engine is such a good idea, but I think it can be done.
>>
>
> There is already a integration of Apache Clerezza with UIMA. Maybe we can 
> build upon this and even if we can not this should provide valuable input on 
> how to use UIMA from an OSGI based framework.

I will look into it. BTW, my main concern is that working with UIMA
involves the generation and compiling of java source code from scripts
that are configured with XML files. Not unlike Corba and early SOAP
stuff, UIMA is older than Java 5, where generics was introduced - this
is why. Now, if you have a UIMA module off the shelf, you don't have
to do the compiling part, but you still have to have the generated
classes on the classpath to be able to consume the annotations the
easy way.  This prohibits us from just writing an adapter once and use
it with every UIMA AE the way it is intended. Of course, this can be
circumvented -- if you have the type system XML you can repackage any
incoming object into a generic FeatureStructure class (Or serialize in
XML, like the UIMASimpleServer does). Maybe there is nothing wrong
with that, but it still feels a bit hairy for me. I will try and see
if it can be done nicely.

>> What do you think of all the above?
>>
>> p.s. Do you have a "How to write and deploy a Hello World Enhancement
>> Engine tutorial"? I have found the description of the functions to
>> implement, but still it took me a while to figure out how to deploy it
>> to felix, etc. If no, I can write one for you based on my notes.
>
> That would be a valuable addition to the Documentation of the Stanbol 
> Enhancer.
>
> BTW: Contributing to the Webpage is easy:
>
> * svn co http://svn.apache.org/repos/asf/incubator/stanbol/site/trunk/ 
> stanbol-website
> * create new content using  Markdown Syntax
> * open an JIRA issue and provide your contribution as patch

Ok, will do at some point!

> best
> Rupert
>
>>
>> Best,
>> Mihály
>

Reply via email to