Indexing nouns only with UIMA works - performance issue?

Kai Gülzau Fri, 01 Feb 2013 02:39:17 -0800

I now use the "stupid" way to use the german corpus for UIMA: copy + paste :-)


I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
...
<fileResourceSpecifier>
  <fileUrl>file:german/TuebaModel.dat</fileUrl>
</fileResourceSpecifier>
...
and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


Next step is to replace every occurrence of "HmmTagger" in
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
with "HmmTaggerDE" an save it as
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

This can be used in your schema.xml:
<fieldType name="uima_nouns_de" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
      descriptorPath="/uima/AggregateSentenceDEAE.xml" 
tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/>
    <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" 
types="/uima/whitelist_de.txt" />
  </analyzer>
</fieldType>

There should be a way to accomplish this via config though.



Last open issue: Performance!

First run via Admin GUI analyze index value "Klaus geht in das Haus und sieht 
eine Maus." / query: "": ~ 5 seconds
Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"

Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine Maus." 
/ query: "": ~ 4 seconds
Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"

Initialized 3 times?
I think some of the components are not reused while analyzing.

Is this a known issue?


Regards,

Kai Gülzau



-----Original Message-----
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 6:48 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing nouns only - UIMA vs. OpenNLP

UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

<fieldType name="uima_nouns_en" class="solr.TextField" 
positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
      descriptorPath="/uima/AggregateSentenceAE.xml" 
tokenType="org.apache.uima.TokenAnnotation"
      featurePath="posTag"/>
    <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" />
  </analyzer>
</fieldType>

Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???


Kai Gülzau

Indexing nouns only with UIMA works - performance issue?

Reply via email to