Indexing nouns only - UIMA vs. OpenNLP

Kai Gülzau Thu, 31 Jan 2013 05:20:01 -0800

Hi,

I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)



First try was to use UIMA with the HMMTagger:

<processor 
class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
  <lst name="uimaConfig">
    <lst name="runtimeParameters"></lst>
    <str 
name="analysisEngine">/org/apache/uima/desc/AggregateSentenceAE.xml</str>
    <bool name="ignoreErrors">false</bool>
    <lst name="analyzeFields">
      <bool name="merge">false</bool>
      <arr name="fields"><str>albody</str></arr>
    </lst>
    <lst name="fieldMappings">
      <lst name="type">
        <str name="name">org.apache.uima.SentenceAnnotation</str>
        <lst name="mapping">
          <str name="feature">coveredText</str>
          <str name="field">albody2</str>
        </lst>
      </lst>
   </lst>
  </lst>
</processor>

- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?


Second try is to use OpenNLP and to apply the patch 
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.


Any pointers appreciated :-)

Regards,

Kai Gülzau

Indexing nouns only - UIMA vs. OpenNLP

Reply via email to