I now use the "stupid" way to use the german corpus for UIMA: copy + paste :-)
I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus ... <fileResourceSpecifier> <fileUrl>file:german/TuebaModel.dat</fileUrl> </fileResourceSpecifier> ... and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml Next step is to replace every occurrence of "HmmTagger" in lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml with "HmmTaggerDE" an save it as lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml This can be used in your schema.xml: <fieldType name="uima_nouns_de" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" descriptorPath="/uima/AggregateSentenceDEAE.xml" tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/> <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" types="/uima/whitelist_de.txt" /> </analyzer> </fieldType> There should be a way to accomplish this via config though. Last open issue: Performance! First run via Admin GUI analyze index value "Klaus geht in das Haus und sieht eine Maus." / query: "": ~ 5 seconds Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: "Whitespace tokenizer successfully initialized" Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit Information: "Whitespace tokenizer typesystem initialized" Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: "Whitespace tokenizer starts processing" Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: "Whitespace tokenizer finished processing" Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: "Whitespace tokenizer successfully initialized" Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit Information: "Whitespace tokenizer typesystem initialized" Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: "Whitespace tokenizer starts processing" Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: "Whitespace tokenizer finished processing" Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: "Whitespace tokenizer successfully initialized" Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit Information: "Whitespace tokenizer typesystem initialized" Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: "Whitespace tokenizer starts processing" Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: "Whitespace tokenizer finished processing" Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine Maus." / query: "": ~ 4 seconds Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: "Whitespace tokenizer successfully initialized" Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit Information: "Whitespace tokenizer typesystem initialized" Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: "Whitespace tokenizer starts processing" Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: "Whitespace tokenizer finished processing" Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: "Whitespace tokenizer successfully initialized" Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit Information: "Whitespace tokenizer typesystem initialized" Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: "Whitespace tokenizer starts processing" Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: "Whitespace tokenizer finished processing" Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: "Whitespace tokenizer successfully initialized" Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit Information: "Whitespace tokenizer typesystem initialized" Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: "Whitespace tokenizer starts processing" Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: "Whitespace tokenizer finished processing" Initialized 3 times? I think some of the components are not reused while analyzing. Is this a known issue? Regards, Kai Gülzau -----Original Message----- From: Kai Gülzau [mailto:kguel...@novomind.com] Sent: Thursday, January 31, 2013 6:48 PM To: solr-user@lucene.apache.org Subject: RE: Indexing nouns only - UIMA vs. OpenNLP UIMA: I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 Now I am able to use this analyzer for english texts and filter (un)wanted token types :-) <fieldType name="uima_nouns_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" descriptorPath="/uima/AggregateSentenceAE.xml" tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/> <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" /> </analyzer> </fieldType> Open issue -> How to set the ModelFile for the Tagger to "german/TuebaModel.dat" ??? Kai Gülzau