Thanks Kai for your feedback, I'll look into it and let you know. Regards, Tommaso
2013/2/1 Kai Gülzau <kguel...@novomind.com> > I now use the "stupid" way to use the german corpus for UIMA: copy + paste > :-) > > I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus > ... > <fileResourceSpecifier> > <fileUrl>file:german/TuebaModel.dat</fileUrl> > </fileResourceSpecifier> > ... > and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml > > > Next step is to replace every occurrence of "HmmTagger" in > lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml > with "HmmTaggerDE" an save it as > lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml > > This can be used in your schema.xml: > <fieldType name="uima_nouns_de" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" > descriptorPath="/uima/AggregateSentenceDEAE.xml" > tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/> > <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" > types="/uima/whitelist_de.txt" /> > </analyzer> > </fieldType> > > There should be a way to accomplish this via config though. > > > > Last open issue: Performance! > > First run via Admin GUI analyze index value "Klaus geht in das Haus und > sieht eine Maus." / query: "": ~ 5 seconds > Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: > "Whitespace tokenizer successfully initialized" > Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit > Information: "Whitespace tokenizer typesystem initialized" > Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer starts processing" > Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer finished processing" > Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: > "Whitespace tokenizer successfully initialized" > Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit > Information: "Whitespace tokenizer typesystem initialized" > Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer starts processing" > Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer finished processing" > Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: > "Whitespace tokenizer successfully initialized" > Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit > Information: "Whitespace tokenizer typesystem initialized" > Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer starts processing" > Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer finished processing" > > Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine > Maus." / query: "": ~ 4 seconds > Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: > "Whitespace tokenizer successfully initialized" > Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit > Information: "Whitespace tokenizer typesystem initialized" > Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer starts processing" > Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer finished processing" > Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: > "Whitespace tokenizer successfully initialized" > Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit > Information: "Whitespace tokenizer typesystem initialized" > Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer starts processing" > Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer finished processing" > Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: > "Whitespace tokenizer successfully initialized" > Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit > Information: "Whitespace tokenizer typesystem initialized" > Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer starts processing" > Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process > Information: "Whitespace tokenizer finished processing" > > Initialized 3 times? > I think some of the components are not reused while analyzing. > > Is this a known issue? > > > Regards, > > Kai Gülzau > > > > -----Original Message----- > From: Kai Gülzau [mailto:kguel...@novomind.com] > Sent: Thursday, January 31, 2013 6:48 PM > To: solr-user@lucene.apache.org > Subject: RE: Indexing nouns only - UIMA vs. OpenNLP > > UIMA: > > I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 > Now I am able to use this analyzer for english texts and filter (un)wanted > token types :-) > > <fieldType name="uima_nouns_en" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" > descriptorPath="/uima/AggregateSentenceAE.xml" > tokenType="org.apache.uima.TokenAnnotation" > featurePath="posTag"/> > <filter class="solr.TypeTokenFilterFactory" > types="/uima/stoptypes.txt" /> > </analyzer> > </fieldType> > > Open issue -> How to set the ModelFile for the Tagger to > "german/TuebaModel.dat" ??? > > > Kai Gülzau > >