RE: Indexing nouns only with UIMA works - performance issue?

2013-02-05 Thread Kai Gülzau
So with https://issues.apache.org/jira/browse/LUCENE-4749 it's possible to set 
the ModelFile?

tokenizer class=solr.UIMAAnnotationsTokenizerFactory
descriptorPath=/uima/AggregateSentenceAE.xml 
tokenType=org.apache.uima.SentenceAnnotation ngramsize=2
modelFile=file:german/TuebaModel.dat /

???

Thanks,

Kai 


-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] 
Sent: Monday, February 04, 2013 2:47 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing nouns only with UIMA works - performance issue?

see an example at
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where
the 'ngramsize' parameter is set, that's defined in
AggregateSentenceAE.xml descriptor and is then set with the given actual
value.
HTH,

Tommaso


Re: Indexing nouns only with UIMA works - performance issue?

2013-02-05 Thread Tommaso Teofili
right, that should be possible (if using trunk or branch_4x, which will be
4.2).

Tommaso


2013/2/5 Kai Gülzau kguel...@novomind.com

 So with https://issues.apache.org/jira/browse/LUCENE-4749 it's possible
 to set the ModelFile?

 tokenizer class=solr.UIMAAnnotationsTokenizerFactory
 descriptorPath=/uima/AggregateSentenceAE.xml
 tokenType=org.apache.uima.SentenceAnnotation ngramsize=2
 modelFile=file:german/TuebaModel.dat /

 ???

 Thanks,

 Kai


 -Original Message-
 From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com]
 Sent: Monday, February 04, 2013 2:47 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Indexing nouns only with UIMA works - performance issue?

 see an example at

 http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where
 the 'ngramsize' parameter is set, that's defined in
 AggregateSentenceAE.xml descriptor and is then set with the given actual
 value.
 HTH,

 Tommaso



Re: Indexing nouns only with UIMA works - performance issue?

2013-02-04 Thread Tommaso Teofili
Thanks Kai for your feedback, I'll look into it and let you know.
Regards,
Tommaso


2013/2/1 Kai Gülzau kguel...@novomind.com

 I now use the stupid way to use the german corpus for UIMA: copy + paste
 :-)

 I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
 ...
 fileResourceSpecifier
   fileUrlfile:german/TuebaModel.dat/fileUrl
 /fileResourceSpecifier
 ...
 and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


 Next step is to replace every occurrence of HmmTagger in
 lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
 with HmmTaggerDE an save it as
 lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

 This can be used in your schema.xml:
 fieldType name=uima_nouns_de class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
   descriptorPath=/uima/AggregateSentenceDEAE.xml
 tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
 filter class=solr.TypeTokenFilterFactory useWhitelist=true
 types=/uima/whitelist_de.txt /
   /analyzer
 /fieldType

 There should be a way to accomplish this via config though.



 Last open issue: Performance!

 First run via Admin GUI analyze index value Klaus geht in das Haus und
 sieht eine Maus. / query: : ~ 5 seconds
 Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing

 Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine
 Maus. / query: : ~ 4 seconds
 Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing

 Initialized 3 times?
 I think some of the components are not reused while analyzing.

 Is this a known issue?


 Regards,

 Kai Gülzau



 -Original Message-
 From: Kai Gülzau [mailto:kguel...@novomind.com]
 Sent: Thursday, January 31, 2013 6:48 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing nouns only - UIMA vs. OpenNLP

 UIMA:

 I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
 Now I am able to use this analyzer for english texts and filter (un)wanted
 token types :-)

 fieldType name=uima_nouns_en class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
   descriptorPath=/uima/AggregateSentenceAE.xml
 tokenType=org.apache.uima.TokenAnnotation
   featurePath=posTag/
 filter class=solr.TypeTokenFilterFactory
 types=/uima/stoptypes.txt /
   

Re: Indexing nouns only with UIMA works - performance issue?

2013-02-04 Thread Tommaso Teofili
Regarding configuration parameters have a look at
https://issues.apache.org/jira/browse/LUCENE-4749
Regards,
Tommaso

2013/2/4 Tommaso Teofili tommaso.teof...@gmail.com

 Thanks Kai for your feedback, I'll look into it and let you know.
 Regards,
 Tommaso


 2013/2/1 Kai Gülzau kguel...@novomind.com

 I now use the stupid way to use the german corpus for UIMA: copy +
 paste :-)

 I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
 ...
 fileResourceSpecifier
   fileUrlfile:german/TuebaModel.dat/fileUrl
 /fileResourceSpecifier
 ...
 and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


 Next step is to replace every occurrence of HmmTagger in
 lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
 with HmmTaggerDE an save it as
 lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

 This can be used in your schema.xml:
 fieldType name=uima_nouns_de class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
   descriptorPath=/uima/AggregateSentenceDEAE.xml
 tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
 filter class=solr.TypeTokenFilterFactory useWhitelist=true
 types=/uima/whitelist_de.txt /
   /analyzer
 /fieldType

 There should be a way to accomplish this via config though.



 Last open issue: Performance!

 First run via Admin GUI analyze index value Klaus geht in das Haus und
 sieht eine Maus. / query: : ~ 5 seconds
 Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing

 Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine
 Maus. / query: : ~ 4 seconds
 Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing

 Initialized 3 times?
 I think some of the components are not reused while analyzing.

 Is this a known issue?


 Regards,

 Kai Gülzau



 -Original Message-
 From: Kai Gülzau [mailto:kguel...@novomind.com]
 Sent: Thursday, January 31, 2013 6:48 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing nouns only - UIMA vs. OpenNLP

 UIMA:

 I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
 Now I am able to use this analyzer for english texts and filter
 (un)wanted token types :-)

 fieldType name=uima_nouns_en class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
   

Re: Indexing nouns only with UIMA works - performance issue?

2013-02-04 Thread Tommaso Teofili
see an example at
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/contrib/uima/src/test-files/uima/uima-tokenizers-schema.xml?view=diffr1=1442116r2=1442117pathrev=1442117where
the 'ngramsize' parameter is set, that's defined in
AggregateSentenceAE.xml descriptor and is then set with the given actual
value.
HTH,

Tommaso


2013/2/4 Tommaso Teofili tommaso.teof...@gmail.com

 Regarding configuration parameters have a look at
 https://issues.apache.org/jira/browse/LUCENE-4749
 Regards,
 Tommaso


 2013/2/4 Tommaso Teofili tommaso.teof...@gmail.com

 Thanks Kai for your feedback, I'll look into it and let you know.
 Regards,
 Tommaso


 2013/2/1 Kai Gülzau kguel...@novomind.com

 I now use the stupid way to use the german corpus for UIMA: copy +
 paste :-)

 I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
 ...
 fileResourceSpecifier
   fileUrlfile:german/TuebaModel.dat/fileUrl
 /fileResourceSpecifier
 ...
 and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


 Next step is to replace every occurrence of HmmTagger in
 lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
 with HmmTaggerDE an save it as
 lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

 This can be used in your schema.xml:
 fieldType name=uima_nouns_de class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.UIMATypeAwareAnnotationsTokenizerFactory
   descriptorPath=/uima/AggregateSentenceDEAE.xml
 tokenType=org.apache.uima.TokenAnnotation featurePath=posTag/
 filter class=solr.TypeTokenFilterFactory useWhitelist=true
 types=/uima/whitelist_de.txt /
   /analyzer
 /fieldType

 There should be a way to accomplish this via config though.



 Last open issue: Performance!

 First run via Admin GUI analyze index value Klaus geht in das Haus und
 sieht eine Maus. / query: : ~ 5 seconds
 Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing

 Second run via Admin GUI analyze Klaus geht in das Haus und sieht eine
 Maus. / query: : ~ 4 seconds
 Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing
 Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize Information:
 Whitespace tokenizer successfully initialized
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit
 Information: Whitespace tokenizer typesystem initialized
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer starts processing
 Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process
  Information: Whitespace tokenizer finished processing

 Initialized 3 times?
 I think some of the components are not reused while analyzing.

 Is this a known issue?


 Regards,

 Kai Gülzau



 -Original Message-
 From: Kai Gülzau [mailto:kguel...@novomind.com]
 Sent: Thursday, January 31, 2013 6:48 PM
 To: solr-user@lucene.apache.org
 Subject: RE: