Re: Indexing nouns only with UIMA works - performance issue?

Tommaso Teofili Mon, 04 Feb 2013 01:54:36 -0800

Thanks Kai for your feedback, I'll look into it and let you know.
Regards,
Tommaso



2013/2/1 Kai Gülzau <kguel...@novomind.com>

> I now use the "stupid" way to use the german corpus for UIMA: copy + paste
> :-)
>
> I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
> ...
> <fileResourceSpecifier>
>   <fileUrl>file:german/TuebaModel.dat</fileUrl>
> </fileResourceSpecifier>
> ...
> and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml
>
>
> Next step is to replace every occurrence of "HmmTagger" in
> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
> with "HmmTaggerDE" an save it as
> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml
>
> This can be used in your schema.xml:
> <fieldType name="uima_nouns_de" class="solr.TextField"
> positionIncrementGap="100">
>   <analyzer>
>     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
>       descriptorPath="/uima/AggregateSentenceDEAE.xml"
> tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/>
>     <filter class="solr.TypeTokenFilterFactory" useWhitelist="true"
> types="/uima/whitelist_de.txt" />
>   </analyzer>
> </fieldType>
>
> There should be a way to accomplish this via config though.
>
>
>
> Last open issue: Performance!
>
> First run via Admin GUI analyze index value "Klaus geht in das Haus und
> sieht eine Maus." / query: "": ~ 5 seconds
> Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information:
> "Whitespace tokenizer successfully initialized"
> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit
> Information: "Whitespace tokenizer typesystem initialized"
> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer starts processing"
> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer finished processing"
> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information:
> "Whitespace tokenizer successfully initialized"
> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit
> Information: "Whitespace tokenizer typesystem initialized"
> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer starts processing"
> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer finished processing"
> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information:
> "Whitespace tokenizer successfully initialized"
> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit
> Information: "Whitespace tokenizer typesystem initialized"
> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer starts processing"
> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer finished processing"
>
> Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine
> Maus." / query: "": ~ 4 seconds
> Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information:
> "Whitespace tokenizer successfully initialized"
> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit
> Information: "Whitespace tokenizer typesystem initialized"
> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer starts processing"
> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer finished processing"
> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information:
> "Whitespace tokenizer successfully initialized"
> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit
> Information: "Whitespace tokenizer typesystem initialized"
> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer starts processing"
> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer finished processing"
> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information:
> "Whitespace tokenizer successfully initialized"
> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit
> Information: "Whitespace tokenizer typesystem initialized"
> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer starts processing"
> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
>  Information: "Whitespace tokenizer finished processing"
>
> Initialized 3 times?
> I think some of the components are not reused while analyzing.
>
> Is this a known issue?
>
>
> Regards,
>
> Kai Gülzau
>
>
>
> -----Original Message-----
> From: Kai Gülzau [mailto:kguel...@novomind.com]
> Sent: Thursday, January 31, 2013 6:48 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing nouns only - UIMA vs. OpenNLP
>
> UIMA:
>
> I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
> Now I am able to use this analyzer for english texts and filter (un)wanted
> token types :-)
>
> <fieldType name="uima_nouns_en" class="solr.TextField"
> positionIncrementGap="100">
>   <analyzer>
>     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
>       descriptorPath="/uima/AggregateSentenceAE.xml"
> tokenType="org.apache.uima.TokenAnnotation"
>       featurePath="posTag"/>
>     <filter class="solr.TypeTokenFilterFactory"
> types="/uima/stoptypes.txt" />
>   </analyzer>
> </fieldType>
>
> Open issue -> How to set the ModelFile for the Tagger to
> "german/TuebaModel.dat" ???
>
>
> Kai Gülzau
>
>

Re: Indexing nouns only with UIMA works - performance issue?

Reply via email to