AW: Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-31 Thread Wunderlich, Tobias
Thanks for all the responses. I went with the UpdateRequestProcessor and it 
works.


-Ursprüngliche Nachricht-
Von: Lance Norskog [mailto:goks...@gmail.com] 
Gesendet: Samstag, 26. Mai 2012 01:53
An: solr-user@lucene.apache.org
Betreff: Re: Creating custom Filter / Tokenizer / Request Handler for 
integration of NER-Framework

Another problem (just discovered this): TokenizerFactories do not get resource 
handlers. So, you can't go read config or model files for your Tokenizer. 
TokenFilters do, so you can use the KeywordTokenizer (make one big term) and do 
your work in a TokenFilter that gets the whole thing.

On Thu, May 24, 2012 at 7:33 AM, Jan Høydahl  wrote:
> As Ahmet says, The Update Chain is probably the place to integrate such 
> document oriented processing.
> See http://www.cominvent.com/2011/04/04/solr-architecture-diagram/ for how it 
> integrates with Solr.
>
> --
> Jan Høydahl, search solution architect Cominvent AS - 
> www.facebook.com/Cominvent Solr Training - www.solrtraining.com
>
> On 24. mai 2012, at 14:04, Wunderlich, Tobias wrote:
>
>> Hey Guys,
>>
>> I am recently working on a project to integrate a 
>> Named-Entity-Recognition-Framework (NER) in an existing searchplatform based 
>> on Solr. The Platform uses ManifoldCF to automatically gather the content 
>> from various repositories. The NER-Framework creates Annotations/Metadata 
>> from given content which I then want to integrate into the search-platform 
>> as metadata to use for faceting. Since MCF handles all content gathering, I 
>> need a way to integrate the NER-Framework directly into Solr. The Goal is to 
>> get all Annotations per document into a multivalued field.  My first thought 
>> was to create a custom filter, which just takes the content and gives back 
>> only the Annotations.  But as I understand it, a filter only processes 
>> predetermined Tokens, which is useless for my purpose, since the 
>> NER-Framework needs to process the whole content of a document. What about a 
>> custom Tokenizer? Would it be possible to process the whole text and give 
>> back only the Annotations as Tokens? A third thought was to manipulate the 
>> ExtractRequestHandler (Solr Cell) used by MCF to somehow add the Annotations 
>> as Metadata when the content and metadata is distributed to the different 
>> fields.
>>
>> I hope my problem description is sufficient. Does anybody have any thoughts 
>> on that subject?
>>
>> Best regards,
>> Tobias
>



--
Lance Norskog
goks...@gmail.com


Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-24 Thread Wunderlich, Tobias
Hey Guys,

I am recently working on a project to integrate a 
Named-Entity-Recognition-Framework (NER) in an existing searchplatform based on 
Solr. The Platform uses ManifoldCF to automatically gather the content from 
various repositories. The NER-Framework creates Annotations/Metadata from given 
content which I then want to integrate into the search-platform as metadata to 
use for faceting. Since MCF handles all content gathering, I need a way to 
integrate the NER-Framework directly into Solr. The Goal is to get all 
Annotations per document into a multivalued field.  My first thought was to 
create a custom filter, which just takes the content and gives back only the 
Annotations.  But as I understand it, a filter only processes predetermined 
Tokens, which is useless for my purpose, since the NER-Framework needs to 
process the whole content of a document. What about a custom Tokenizer? Would 
it be possible to process the whole text and give back only the Annotations as 
Tokens? A third thought was to manipulate the ExtractRequestHandler (Solr Cell) 
used by MCF to somehow add the Annotations as Metadata when the content and 
metadata is distributed to the different fields.

I hope my problem description is sufficient. Does anybody have any thoughts on 
that subject?

Best regards,
Tobias


Creating custom Filter / Tokenizer / Request Handler for integration of NER-Framework

2012-05-24 Thread Wunderlich, Tobias
Hey Guys,

I am recently working on a project to integrate a 
Named-Entity-Recognition-Framework (NER) in an existing searchplatform based on 
Solr. The Platform uses ManifoldCF to automatically gather the content from 
various repositories. The NER-Framework creates Annotations/Metadata from given 
content which I then want to integrate into the search-platform as metadata to 
use for faceting. Since MCF handles all content gathering, I need a way to 
integrate the NER-Framework directly into Solr. The Goal is to get all 
Annotations per document into a multivalued field.  My first thought was to 
create a custom filter, which just takes the content and gives back only the 
Annotations.  But as I understand it, a filter only processes predetermined 
Tokens, which is useless for my purpose, since the NER-Framework needs to 
process the whole content of a document. What about a custom Tokenizer? Would 
it be possible to process the whole text and give back only the Annotations as 
Tokens? A third thought was to manipulate the ExtractRequestHandler (Solr Cell) 
used by MCF to somehow add the Annotations as Metadata when the content and 
metadata is distributed to the different fields.

I hope my problem description is sufficient. Does anybody have any thoughts on 
that subject?

Best regards,
Tobias