[Nutch Wiki] Update of "MultiLingualSupport" by AndrzejBialecki

Apache Wiki Sat, 03 Feb 2007 12:36:24 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/MultiLingualSupport

------------------------------------------------------------------------------
  I've attached my nutch-site.xml to show the complete plugin configuration
  attachment:nutch-site.xml
  
+ The use of stemming needs to be carefully evaluated for a given corpus and 
target audience. There are a few aspects to be considered:
+ 
+  * stemming is language-specific, but if the corpus is mixed-language, even 
if we detect the language of each document and apply a proper stemmer, we are 
still facing the problem of correct identification of the language of the query 
(and applying the correct stemmer to the query - see above).
+ 
+  * stemming is likely to increase recall, but also it's likely to reduce 
precision. It's more useful for morphologically rich languages, and it's also 
more useful for smaller collections (where users prefer to receive any results, 
even if they are less precise, over receiving none results whatsoever - thus 
trading precision for recall). For larger corpora consisting of mostly 
mono-lingual documents stemming usually doesn't improve quality of results.
+ 
+  * there is usually more than one different stemmer implementation for a 
given language, each giving different results in terms of precision/recall for 
a given corpus. Sometimes an aggressive, iterative stemmer (such as the Porter 
stemmer) may give worse results than a light custom stemmer that only conflates 
single/plural forms.
+ 
  
  == References ==

[Nutch Wiki] Update of "MultiLingualSupport" by AndrzejBialecki

Reply via email to