Re: Indexing nouns only with UIMA works - performance issue?

Tommaso Teofili Mon, 04 Feb 2013 05:23:18 -0800

Regarding configuration parameters have a look at
https://issues.apache.org/jira/browse/LUCENE-4749
Regards,
Tommaso


2013/2/4 Tommaso Teofili <tommaso.teof...@gmail.com>

> Thanks Kai for your feedback, I'll look into it and let you know.
> Regards,
> Tommaso
>
>
> 2013/2/1 Kai Gülzau <kguel...@novomind.com>
>
>> I now use the "stupid" way to use the german corpus for UIMA: copy +
>> paste :-)
>>
>> I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
>> ...
>> <fileResourceSpecifier>
>>   <fileUrl>file:german/TuebaModel.dat</fileUrl>
>> </fileResourceSpecifier>
>> ...
>> and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml
>>
>>
>> Next step is to replace every occurrence of "HmmTagger" in
>> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
>> with "HmmTaggerDE" an save it as
>> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml
>>
>> This can be used in your schema.xml:
>> <fieldType name="uima_nouns_de" class="solr.TextField"
>> positionIncrementGap="100">
>>   <analyzer>
>>     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
>>       descriptorPath="/uima/AggregateSentenceDEAE.xml"
>> tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/>
>>     <filter class="solr.TypeTokenFilterFactory" useWhitelist="true"
>> types="/uima/whitelist_de.txt" />
>>   </analyzer>
>> </fieldType>
>>
>> There should be a way to accomplish this via config though.
>>
>>
>>
>> Last open issue: Performance!
>>
>> First run via Admin GUI analyze index value "Klaus geht in das Haus und
>> sieht eine Maus." / query: "": ~ 5 seconds
>> Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information:
>> "Whitespace tokenizer successfully initialized"
>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit
>> Information: "Whitespace tokenizer typesystem initialized"
>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer starts processing"
>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer finished processing"
>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information:
>> "Whitespace tokenizer successfully initialized"
>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit
>> Information: "Whitespace tokenizer typesystem initialized"
>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer starts processing"
>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer finished processing"
>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information:
>> "Whitespace tokenizer successfully initialized"
>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit
>> Information: "Whitespace tokenizer typesystem initialized"
>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer starts processing"
>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer finished processing"
>>
>> Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine
>> Maus." / query: "": ~ 4 seconds
>> Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information:
>> "Whitespace tokenizer successfully initialized"
>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit
>> Information: "Whitespace tokenizer typesystem initialized"
>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer starts processing"
>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer finished processing"
>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information:
>> "Whitespace tokenizer successfully initialized"
>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit
>> Information: "Whitespace tokenizer typesystem initialized"
>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer starts processing"
>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer finished processing"
>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information:
>> "Whitespace tokenizer successfully initialized"
>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit
>> Information: "Whitespace tokenizer typesystem initialized"
>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer starts processing"
>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
>>  Information: "Whitespace tokenizer finished processing"
>>
>> Initialized 3 times?
>> I think some of the components are not reused while analyzing.
>>
>> Is this a known issue?
>>
>>
>> Regards,
>>
>> Kai Gülzau
>>
>>
>>
>> -----Original Message-----
>> From: Kai Gülzau [mailto:kguel...@novomind.com]
>> Sent: Thursday, January 31, 2013 6:48 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Indexing nouns only - UIMA vs. OpenNLP
>>
>> UIMA:
>>
>> I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
>> Now I am able to use this analyzer for english texts and filter
>> (un)wanted token types :-)
>>
>> <fieldType name="uima_nouns_en" class="solr.TextField"
>> positionIncrementGap="100">
>>   <analyzer>
>>     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
>>       descriptorPath="/uima/AggregateSentenceAE.xml"
>> tokenType="org.apache.uima.TokenAnnotation"
>>       featurePath="posTag"/>
>>     <filter class="solr.TypeTokenFilterFactory"
>> types="/uima/stoptypes.txt" />
>>   </analyzer>
>> </fieldType>
>>
>> Open issue -> How to set the ModelFile for the Tagger to
>> "german/TuebaModel.dat" ???
>>
>>
>> Kai Gülzau
>>
>>
>

Re: Indexing nouns only with UIMA works - performance issue?

Reply via email to