Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by AndrzejBialecki: http://wiki.apache.org/nutch/MultiLingualSupport ------------------------------------------------------------------------------ I've attached my nutch-site.xml to show the complete plugin configuration attachment:nutch-site.xml + The use of stemming needs to be carefully evaluated for a given corpus and target audience. There are a few aspects to be considered: + + * stemming is language-specific, but if the corpus is mixed-language, even if we detect the language of each document and apply a proper stemmer, we are still facing the problem of correct identification of the language of the query (and applying the correct stemmer to the query - see above). + + * stemming is likely to increase recall, but also it's likely to reduce precision. It's more useful for morphologically rich languages, and it's also more useful for smaller collections (where users prefer to receive any results, even if they are less precise, over receiving none results whatsoever - thus trading precision for recall). For larger corpora consisting of mostly mono-lingual documents stemming usually doesn't improve quality of results. + + * there is usually more than one different stemmer implementation for a given language, each giving different results in terms of precision/recall for a given corpus. Sometimes an aggressive, iterative stemmer (such as the Porter stemmer) may give worse results than a light custom stemmer that only conflates single/plural forms. + == References ==