I now use the "stupid" way to use the german corpus for UIMA: copy + paste :-)

I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml

Next step is to replace every occurrence of "HmmTagger" in
with "HmmTaggerDE" an save it as

This can be used in your schema.xml:
<fieldType name="uima_nouns_de" class="solr.TextField" 
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/>
    <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" 
types="/uima/whitelist_de.txt" />

There should be a way to accomplish this via config though.

Last open issue: Performance!

First run via Admin GUI analyze index value "Klaus geht in das Haus und sieht 
eine Maus." / query: "": ~ 5 seconds
Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"

Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine Maus." 
/ query: "": ~ 4 seconds
Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: 
"Whitespace tokenizer successfully initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit     Information: 
"Whitespace tokenizer typesystem initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer starts processing"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process            Information: 
"Whitespace tokenizer finished processing"

Initialized 3 times?
I think some of the components are not reused while analyzing.

Is this a known issue?


Kai Gülzau

-----Original Message-----
From: Kai Gülzau [mailto:kguel...@novomind.com] 
Sent: Thursday, January 31, 2013 6:48 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing nouns only - UIMA vs. OpenNLP


I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

<fieldType name="uima_nouns_en" class="solr.TextField" 
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
    <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" />

Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???

Kai Gülzau

Reply via email to