[Nutch-general] wrong query when using token expansion

viz Mon, 23 Jul 2007 12:02:34 -0700

Hallo

I'm working with morphologically rich languages that use lots of accenting
as well.  Thus stemming and accent in-/sensitivity matter a lot. They affect
recall and precision greatly.


My approach (simplified here): Using nutch 0.9. The Analyzer/Tokenizer can
return two tokens per each input word: 
 -the original word
 -and the token which is an unaccented stem of the original (with the
PositionIncrement set to 0 !!)

An artificial example:
füümöö -> füümöö, fuum

At cost of this index expansion I expect to gain a high recall (due to the
stem), yet allowing for high precision (if searched for exact match with
original word, e.g. by using quotes)

However if I execute query 'füümöö' (without quotes) the query parser in
NutchAnalysis generates a boolean query like:
   boolean query:+(url:"füümöö fuum"^4.0) .....

which, of course, returns no hits!

Why is the parser doing this - ignoring posIncrement and creating a *phrase*
instead? Is this intended behavior?  If yes, what is then the way to go in
nutch? I assume my usecase is common for most of the languages.

Thanks,

Viktor
-- 
View this message in context: 
http://www.nabble.com/wrong-query-when-using-token-expansion-tf4131766.html#a11750644
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] wrong query when using token expansion

Reply via email to