On 14-02-23 08:35 AM, Martin Wunderlich wrote:
Hi all,
I recently started working with OpenNLP for a project in the area of
text classification with neural networks. So far, OpenNLP is a great
library and very useful.
There are just three things that I haven't been able to find, but
maybe they do exist:
- language models: e.g. to create a bigram language model with
relative and absolute frequencies from several texts
- stemming: to reduce different word forms in inflected languages to a
canonical root form
- stoplist: to remove certain words (e.g. from the language model)
that are deemed irrelevant
Do these functions exist in OpenNLP? If not, can you recommend another
library to complement these functions?
Lucene's analyzers-common [1] has stemming algorithms and stoplists for
many languages (for examples, look at [2] and [3]) . It might be a good
starting point.
Hope this help,
Alexandre
[1] http://lucene.apache.org/core/4_6_1/analyzers-common/index.html
[2]
http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/en/EnglishAnalyzer.html
[3]
http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/fr/FrenchAnalyzer.html
Kind regards,
Martin