Thanks, Kai!

About removing non-nouns: the OpenNLP patch includes two simple TokenFilters for manipulating terms with payloads. The FilterPayloadFilter lets you keep or remove terms with given payloads. In the demo schema.xml, there is an example type that keeps only nouns&verbs.

There is a "universal" mapping for parts-of-speech systems for different languages. There is no Solr/Lucene support for it.
http://code.google.com/p/universal-pos-tags/

On 01/31/2013 09:47 AM, Kai Gülzau wrote:
UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted 
token types :-)

<fieldType name="uima_nouns_en" class="solr.TextField" 
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
       descriptorPath="/uima/AggregateSentenceAE.xml" 
tokenType="org.apache.uima.TokenAnnotation"
       featurePath="posTag"/>
     <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" />
   </analyzer>
</fieldType>

Open issue -> How to set the ModelFile for the Tagger to 
"german/TuebaModel.dat" ???



OpenNLP:

And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is 
now working
with solr 4.1. :-)

<fieldType name="nlp_nouns_de" class="solr.TextField" 
positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.OpenNLPTokenizerFactory" 
tokenizerModel="opennlp/de-token.bin" />
       <filter class="solr.OpenNLPFilterFactory" 
posTaggerModel="opennlp/de-pos-maxent.bin" />
       <filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,FM" 
keepPayloads="true"/>
       <filter class="solr.StripPayloadsFilterFactory"/>
   </analyzer>
</fieldType>



Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via 
Analyzer Admin GUI)?


Regards,

Kai Gülzau




-----Original Message-----
From: Kai Gülzau [mailto:kguel...@novomind.com]
Sent: Thursday, January 31, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Indexing nouns only - UIMA vs. OpenNLP

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)


First try was to use UIMA with the HMMTagger:

<processor 
class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
   <lst name="uimaConfig">
     <lst name="runtimeParameters"></lst>
     <str 
name="analysisEngine">/org/apache/uima/desc/AggregateSentenceAE.xml</str>
     <bool name="ignoreErrors">false</bool>
     <lst name="analyzeFields">
       <bool name="merge">false</bool>
       <arr name="fields"><str>albody</str></arr>
     </lst>
     <lst name="fieldMappings">
       <lst name="type">
         <str name="name">org.apache.uima.SentenceAnnotation</str>
         <lst name="mapping">
           <str name="feature">coveredText</str>
           <str name="field">albody2</str>
         </lst>
       </lst>
    </lst>
   </lst>
</processor>

- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau


Reply via email to