Re: couple of Sandbox components

Tommaso Teofili Sat, 27 Nov 2010 04:59:14 -0800

2010/11/26 Jörn Kottmann <kottm...@gmail.com>

> On 11/26/10 9:35 AM, Tommaso Teofili wrote:
>
>> Hi all,
>> following Burn's proposal for multimodal analysis component skeleton I
>> also
>> have a couple of components to propose for inclusion inside the sandbox:
>>
>>    - Solr CAS Consumer - to consume CAS/types/features inside Solr fields.
>>    This could be put inside Lucas or in a separate project
>>
>
> As far as I know is the main difference from a configuration point of view,
> is
> that Lucas defines the language analyzers inside the AE configuration
> and Solr defines them in a server side xml configuration file.
> In the end there might be not much which could be reused from Lucas.
>


Only I thought to Lucas because Lucene and Solr are so close that at a high
level it could make sense to have them inside the same component, but  I
agree that from a functional level they are quite different


>
> Lucas is not maintained right now, and I guess that is because most
> people are not interested in creating a Lucene index from a bunch of
> documents.
>

I heard of someone using it, if I can find time to do it I will try to
maintain and update it to latest Lucene (or maybe at a 2.9.3 which is
backward compatible but still has some 3.x major improvements).


>
> The way we use UIMA is to process a stream of documents which are
> received continuously, in this model a Solr AE fits really nicely, because
> it just send the received documents to a Solr server which adds it
> to the index. After a document is received it can be search with a
> short delay. With Lucas that would not be possible.
>
> I actually created a small Solr AE for doing a quick semantic search demo.
> One problem I did run in is that the Solr AE really slows down my
> processing
> pipeline. Anyway I would be happy to test your implementation and
> contribute
> to it.
>

Nice! Thanks, looking forward to cooperate on it.


>
>     - a Simple Language Annotator - to extract language from document text,
>>
>>    this one can use 3 algorithms:
>>       - Tika 0.8 language identification capability
>>       - Alchemy language annotator
>>       - Dictionaries of stopwords for each language
>>
>>  We could easily add AEs which set the language to the Tika and
> Alchemy project we already have. It can also be done with OpenNLP.
>

It's a good idea, so the third algorithm could be put inside the
DictionaryAnnotator.
Cheers,
Tommaso



>
> Jörn
>

Re: couple of Sandbox components

Reply via email to