see an example at http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diff&r1=1442116&r2=1442117&pathrev=1442117where the 'ngramsize' parameter is set, that's defined in AggregateSentenceAE.xml descriptor and is then set with the given actual value. HTH,
Tommaso 2013/2/4 Tommaso Teofili <tommaso.teof...@gmail.com> > Regarding configuration parameters have a look at > https://issues.apache.org/jira/browse/LUCENE-4749 > Regards, > Tommaso > > > 2013/2/4 Tommaso Teofili <tommaso.teof...@gmail.com> > >> Thanks Kai for your feedback, I'll look into it and let you know. >> Regards, >> Tommaso >> >> >> 2013/2/1 Kai Gülzau <kguel...@novomind.com> >> >>> I now use the "stupid" way to use the german corpus for UIMA: copy + >>> paste :-) >>> >>> I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus >>> ... >>> <fileResourceSpecifier> >>> <fileUrl>file:german/TuebaModel.dat</fileUrl> >>> </fileResourceSpecifier> >>> ... >>> and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml >>> >>> >>> Next step is to replace every occurrence of "HmmTagger" in >>> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml >>> with "HmmTaggerDE" an save it as >>> lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml >>> >>> This can be used in your schema.xml: >>> <fieldType name="uima_nouns_de" class="solr.TextField" >>> positionIncrementGap="100"> >>> <analyzer> >>> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" >>> descriptorPath="/uima/AggregateSentenceDEAE.xml" >>> tokenType="org.apache.uima.TokenAnnotation" featurePath="posTag"/> >>> <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" >>> types="/uima/whitelist_de.txt" /> >>> </analyzer> >>> </fieldType> >>> >>> There should be a way to accomplish this via config though. >>> >>> >>> >>> Last open issue: Performance! >>> >>> First run via Admin GUI analyze index value "Klaus geht in das Haus und >>> sieht eine Maus." / query: "": ~ 5 seconds >>> Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: >>> "Whitespace tokenizer successfully initialized" >>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit >>> Information: "Whitespace tokenizer typesystem initialized" >>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer starts processing" >>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer finished processing" >>> Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: >>> "Whitespace tokenizer successfully initialized" >>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit >>> Information: "Whitespace tokenizer typesystem initialized" >>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer starts processing" >>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer finished processing" >>> Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: >>> "Whitespace tokenizer successfully initialized" >>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit >>> Information: "Whitespace tokenizer typesystem initialized" >>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer starts processing" >>> Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer finished processing" >>> >>> Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine >>> Maus." / query: "": ~ 4 seconds >>> Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: >>> "Whitespace tokenizer successfully initialized" >>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit >>> Information: "Whitespace tokenizer typesystem initialized" >>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer starts processing" >>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer finished processing" >>> Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: >>> "Whitespace tokenizer successfully initialized" >>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit >>> Information: "Whitespace tokenizer typesystem initialized" >>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer starts processing" >>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer finished processing" >>> Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: >>> "Whitespace tokenizer successfully initialized" >>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit >>> Information: "Whitespace tokenizer typesystem initialized" >>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer starts processing" >>> Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process >>> Information: "Whitespace tokenizer finished processing" >>> >>> Initialized 3 times? >>> I think some of the components are not reused while analyzing. >>> >>> Is this a known issue? >>> >>> >>> Regards, >>> >>> Kai Gülzau >>> >>> >>> >>> -----Original Message----- >>> From: Kai Gülzau [mailto:kguel...@novomind.com] >>> Sent: Thursday, January 31, 2013 6:48 PM >>> To: solr-user@lucene.apache.org >>> Subject: RE: Indexing nouns only - UIMA vs. OpenNLP >>> >>> UIMA: >>> >>> I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 >>> Now I am able to use this analyzer for english texts and filter >>> (un)wanted token types :-) >>> >>> <fieldType name="uima_nouns_en" class="solr.TextField" >>> positionIncrementGap="100"> >>> <analyzer> >>> <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory" >>> descriptorPath="/uima/AggregateSentenceAE.xml" >>> tokenType="org.apache.uima.TokenAnnotation" >>> featurePath="posTag"/> >>> <filter class="solr.TypeTokenFilterFactory" >>> types="/uima/stoptypes.txt" /> >>> </analyzer> >>> </fieldType> >>> >>> Open issue -> How to set the ModelFile for the Tagger to >>> "german/TuebaModel.dat" ??? >>> >>> >>> Kai Gülzau >>> >>> >> >