Thanks, Kai!
About removing non-nouns: the OpenNLP patch includes two simple
TokenFilters for manipulating terms with payloads. The
FilterPayloadFilter lets you keep or remove terms with given payloads.
In the demo schema.xml, there is an example type that keeps only
nouns&verbs.
There is a "universal" mapping for parts-of-speech systems for different
languages. There is no Solr/Lucene support for it.
http://code.google.com/p/universal-pos-tags/
On 01/31/2013 09:47 AM, Kai Gülzau wrote:
UIMA:
I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted
token types :-)
<fieldType name="uima_nouns_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
descriptorPath="/uima/AggregateSentenceAE.xml"
tokenType="org.apache.uima.TokenAnnotation"
featurePath="posTag"/>
<filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" />
</analyzer>
</fieldType>
Open issue -> How to set the ModelFile for the Tagger to
"german/TuebaModel.dat" ???
OpenNLP:
And a modified patch for https://issues.apache.org/jira/browse/LUCENE-2899 is
now working
with solr 4.1. :-)
<fieldType name="nlp_nouns_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
tokenizerModel="opennlp/de-token.bin" />
<filter class="solr.OpenNLPFilterFactory"
posTaggerModel="opennlp/de-pos-maxent.bin" />
<filter class="solr.FilterPayloadsFilterFactory" payloadList="NN,NNS,NNP,NNPS,FM"
keepPayloads="true"/>
<filter class="solr.StripPayloadsFilterFactory"/>
</analyzer>
</fieldType>
Any hints on which lib is more accurate on noun tagging?
Any performance or memory issues (some OOM here while testing with 1GB via
Analyzer Admin GUI)?
Regards,
Kai Gülzau
-----Original Message-----
From: Kai Gülzau [mailto:[email protected]]
Sent: Thursday, January 31, 2013 2:19 PM
To: [email protected]
Subject: Indexing nouns only - UIMA vs. OpenNLP
Hi,
I am stuck trying to index only the nouns of german and english texts.
(very similar to http://wiki.apache.org/solr/OpenNLP#Full_Example)
First try was to use UIMA with the HMMTagger:
<processor
class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
<lst name="uimaConfig">
<lst name="runtimeParameters"></lst>
<str
name="analysisEngine">/org/apache/uima/desc/AggregateSentenceAE.xml</str>
<bool name="ignoreErrors">false</bool>
<lst name="analyzeFields">
<bool name="merge">false</bool>
<arr name="fields"><str>albody</str></arr>
</lst>
<lst name="fieldMappings">
<lst name="type">
<str name="name">org.apache.uima.SentenceAnnotation</str>
<lst name="mapping">
<str name="feature">coveredText</str>
<str name="field">albody2</str>
</lst>
</lst>
</lst>
</lst>
</processor>
- But how do I set the ModelFile to use the german corpus?
- What about language identification?
-- How do I use the right corpus/tagger based on the language?
-- Should this be done in UIMA (how?) or via solr contrib/langid field mapping?
- How to remove non nouns in the annotated field?
Second try is to use OpenNLP and to apply the patch
https://issues.apache.org/jira/browse/LUCENE-2899
But the patch seems to be a bit out of date.
Currently I try to get it to work with solr 4.1.
Any pointers appreciated :-)
Regards,
Kai Gülzau