Regarding configuration parameters have a look at https://issues.apache.org/jira/browse/LUCENE-4749 Regards, Tommaso
2013/2/4 Tommaso Teofili <tommaso.teof...@gmail.com> > Thanks Kai for your feedback, I'll look into it and let you know. > Regards, > Tommaso > > > 2013/2/1 Kai Gülzau <kguel...@novomind.com> > >> I now use the "stupid" way to use the german corpus for UIMA: copy + >> paste :-) >> >> I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus >> ... >> <fileResourceSpecifier> >> <fileUrl>file:german/TuebaModel.dat</fileUrl> >> </fileResourceSpecifier> >> ... >> and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml >> >> >> Next step is to replace every occurrence of "HmmTagger" in >> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml >> with "HmmTaggerDE" an save it as >> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml >> >> This can be used in your schema.xml: >> <fieldType name="uima_nouns_de" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer> >> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" >> descriptorPath="/uima/AggregateSentenceDEAE.xml" >> tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/> >> <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" >> types="/uima/whitelist_de.txt" /> >> </analyzer> >> </fieldType> >> >> There should be a way to accomplish this via config though. >> >> >> >> Last open issue: Performance! >> >> First run via Admin GUI analyze index value "Klaus geht in das Haus und >> sieht eine Maus." / query: "": ~ 5 seconds >> Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: >> "Whitespace tokenizer successfully initialized" >> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit >> Information: "Whitespace tokenizer typesystem initialized" >> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer starts processing" >> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer finished processing" >> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: >> "Whitespace tokenizer successfully initialized" >> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit >> Information: "Whitespace tokenizer typesystem initialized" >> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer starts processing" >> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer finished processing" >> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: >> "Whitespace tokenizer successfully initialized" >> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit >> Information: "Whitespace tokenizer typesystem initialized" >> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer starts processing" >> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer finished processing" >> >> Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine >> Maus." / query: "": ~ 4 seconds >> Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: >> "Whitespace tokenizer successfully initialized" >> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit >> Information: "Whitespace tokenizer typesystem initialized" >> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer starts processing" >> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer finished processing" >> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: >> "Whitespace tokenizer successfully initialized" >> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit >> Information: "Whitespace tokenizer typesystem initialized" >> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer starts processing" >> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer finished processing" >> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: >> "Whitespace tokenizer successfully initialized" >> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit >> Information: "Whitespace tokenizer typesystem initialized" >> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer starts processing" >> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process >> Information: "Whitespace tokenizer finished processing" >> >> Initialized 3 times? >> I think some of the components are not reused while analyzing. >> >> Is this a known issue? >> >> >> Regards, >> >> Kai Gülzau >> >> >> >> -----Original Message----- >> From: Kai Gülzau [mailto:kguel...@novomind.com] >> Sent: Thursday, January 31, 2013 6:48 PM >> To: solr-user@lucene.apache.org >> Subject: RE: Indexing nouns only - UIMA vs. OpenNLP >> >> UIMA: >> >> I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 >> Now I am able to use this analyzer for english texts and filter >> (un)wanted token types :-) >> >> <fieldType name="uima_nouns_en" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer> >> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" >> descriptorPath="/uima/AggregateSentenceAE.xml" >> tokenType="org.apache.uima.TokenAnnotation" >> featurePath="posTag"/> >> <filter class="solr.TypeTokenFilterFactory" >> types="/uima/stoptypes.txt" /> >> </analyzer> >> </fieldType> >> >> Open issue -> How to set the ModelFile for the Tagger to >> "german/TuebaModel.dat" ??? >> >> >> Kai Gülzau >> >> >