Hi,
I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)
First try was to use UIMA with the HMMTagger:
<processor
class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters"></lst>
<str
name="analysisEngine">/org/apache/uima/desc/AggregateSentenceAE.xml</str>
<bool name="ignoreErrors">false</bool>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields"><str>albody</str></arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">org.apache.uima.SentenceAnnotation</str>
<lst name="mapping">
<str name="feature">coveredText</str>
<str name="field">albody2</str>
</lst>
</lst>
</lst>
</lst>
</processor>
- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?
Second try is to use OpenNLP and to apply the patch
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.
Any pointers appreciated :-)
Regards,
Kai Gülzau