RE: Indexing nouns only with UIMA works - performance issue?
So with https://issues.apache.org/jira/browse/LUCENE-4749 it's possible to set the ModelFile? tokenizer class=solr.UIMAAnnotationsTokenizerFactory descriptorPath=/uima/AggregateSentenceAE.xml tokenType=org.apache.uima.SentenceAnnotation ngramsize=2 modelFile=file:german/TuebaModel.dat / ??? Thanks, Kai -Original Message- From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] Sent: Monday, February 04, 2013 2:47 PM To: solr-user@lucene.apache.org Subject: Re: Indexing nouns only with UIMA works - performance issue? see an example at http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where the 'ngramsize' parameter is set, that's defined in AggregateSentenceAE.xml descriptor and is then set with the given actual value. HTH, Tommaso
Re: Indexing nouns only with UIMA works - performance issue?
right, that should be possible (if using trunk or branch_4x, which will be 4.2). Tommaso 2013/2/5 Kai Gülzau kguel...@novomind.com So with https://issues.apache.org/jira/browse/LUCENE-4749 it's possible to set the ModelFile? tokenizer class=solr.UIMAAnnotationsTokenizerFactory descriptorPath=/uima/AggregateSentenceAE.xml tokenType=org.apache.uima.SentenceAnnotation ngramsize=2 modelFile=file:german/TuebaModel.dat / ??? Thanks, Kai -Original Message- From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] Sent: Monday, February 04, 2013 2:47 PM To: solr-user@lucene.apache.org Subject: Re: Indexing nouns only with UIMA works - performance issue? see an example at http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where the 'ngramsize' parameter is set, that's defined in AggregateSentenceAE.xml descriptor and is then set with the given actual value. HTH, Tommaso
Re: Indexing nouns only with UIMA works - performance issue?
Thanks Kai for your feedback, I'll look into it and let you know. Regards, Tommaso 2013/2/1 Kai Gülzau kguel...@novomind.com I now use the stupid way to use the german corpus for UIMA: copy + paste :-) I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus ... fileResourceSpecifier fileUrlfile:german/TuebaModel.dat/fileUrl /fileResourceSpecifier ... and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml Next step is to replace every occurrence of HmmTagger in lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml with HmmTaggerDE an save it as lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml This can be used in your schema.xml: fieldType name=uima_nouns_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory descriptorPath=/uima/AggregateSentenceDEAE.xml tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/ filter class=solr.TypeTokenFilterFactory useWhitelist=true types=/uima/whitelist_de.txt / /analyzer /fieldType There should be a way to accomplish this via config though. Last open issue: Performance! First run via Admin GUI analyze index value Klaus geht in das Haus und sieht eine Maus. / query: : ~ 5 seconds Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine Maus. / query: : ~ 4 seconds Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Initialized 3 times? I think some of the components are not reused while analyzing. Is this a known issue? Regards, Kai Gülzau -Original Message- From: Kai Gülzau [mailto:kguel...@novomind.com] Sent: Thursday, January 31, 2013 6:48 PM To: solr-user@lucene.apache.org Subject: RE: Indexing nouns only - UIMA vs. OpenNLP UIMA: I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 Now I am able to use this analyzer for english texts and filter (un)wanted token types :-) fieldType name=uima_nouns_en class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory descriptorPath=/uima/AggregateSentenceAE.xml tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/ filter class=solr.TypeTokenFilterFactory types=/uima/stoptypes.txt /
Re: Indexing nouns only with UIMA works - performance issue?
Regarding configuration parameters have a look at https://issues.apache.org/jira/browse/LUCENE-4749 Regards, Tommaso 2013/2/4 Tommaso Teofili tommaso.teof...@gmail.com Thanks Kai for your feedback, I'll look into it and let you know. Regards, Tommaso 2013/2/1 Kai Gülzau kguel...@novomind.com I now use the stupid way to use the german corpus for UIMA: copy + paste :-) I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus ... fileResourceSpecifier fileUrlfile:german/TuebaModel.dat/fileUrl /fileResourceSpecifier ... and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml Next step is to replace every occurrence of HmmTagger in lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml with HmmTaggerDE an save it as lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml This can be used in your schema.xml: fieldType name=uima_nouns_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory descriptorPath=/uima/AggregateSentenceDEAE.xml tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/ filter class=solr.TypeTokenFilterFactory useWhitelist=true types=/uima/whitelist_de.txt / /analyzer /fieldType There should be a way to accomplish this via config though. Last open issue: Performance! First run via Admin GUI analyze index value Klaus geht in das Haus und sieht eine Maus. / query: : ~ 5 seconds Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine Maus. / query: : ~ 4 seconds Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Initialized 3 times? I think some of the components are not reused while analyzing. Is this a known issue? Regards, Kai Gülzau -Original Message- From: Kai Gülzau [mailto:kguel...@novomind.com] Sent: Thursday, January 31, 2013 6:48 PM To: solr-user@lucene.apache.org Subject: RE: Indexing nouns only - UIMA vs. OpenNLP UIMA: I just found this issue https://issues.apache.org/jira/browse/SOLR-3013 Now I am able to use this analyzer for english texts and filter (un)wanted token types :-) fieldType name=uima_nouns_en class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
Re: Indexing nouns only with UIMA works - performance issue?
see an example at http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where the 'ngramsize' parameter is set, that's defined in AggregateSentenceAE.xml descriptor and is then set with the given actual value. HTH, Tommaso 2013/2/4 Tommaso Teofili tommaso.teof...@gmail.com Regarding configuration parameters have a look at https://issues.apache.org/jira/browse/LUCENE-4749 Regards, Tommaso 2013/2/4 Tommaso Teofili tommaso.teof...@gmail.com Thanks Kai for your feedback, I'll look into it and let you know. Regards, Tommaso 2013/2/1 Kai Gülzau kguel...@novomind.com I now use the stupid way to use the german corpus for UIMA: copy + paste :-) I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus ... fileResourceSpecifier fileUrlfile:german/TuebaModel.dat/fileUrl /fileResourceSpecifier ... and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml Next step is to replace every occurrence of HmmTagger in lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml with HmmTaggerDE an save it as lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml This can be used in your schema.xml: fieldType name=uima_nouns_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory descriptorPath=/uima/AggregateSentenceDEAE.xml tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/ filter class=solr.TypeTokenFilterFactory useWhitelist=true types=/uima/whitelist_de.txt / /analyzer /fieldType There should be a way to accomplish this via config though. Last open issue: Performance! First run via Admin GUI analyze index value Klaus geht in das Haus und sieht eine Maus. / query: : ~ 5 seconds Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine Maus. / query: : ~ 4 seconds Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information: Whitespace tokenizer successfully initialized Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit Information: Whitespace tokenizer typesystem initialized Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: Whitespace tokenizer starts processing Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process Information: Whitespace tokenizer finished processing Initialized 3 times? I think some of the components are not reused while analyzing. Is this a known issue? Regards, Kai Gülzau -Original Message- From: Kai Gülzau [mailto:kguel...@novomind.com] Sent: Thursday, January 31, 2013 6:48 PM To: solr-user@lucene.apache.org Subject: RE: