Hallo I'm working with morphologically rich languages that use lots of accenting as well. Thus stemming and accent in-/sensitivity matter a lot. They affect recall and precision greatly.
My approach (simplified here): Using nutch 0.9. The Analyzer/Tokenizer can return two tokens per each input word: -the original word -and the token which is an unaccented stem of the original (with the PositionIncrement set to 0 !!) An artificial example: füümöö -> füümöö, fuum At cost of this index expansion I expect to gain a high recall (due to the stem), yet allowing for high precision (if searched for exact match with original word, e.g. by using quotes) However if I execute query 'füümöö' (without quotes) the query parser in NutchAnalysis generates a boolean query like: boolean query:+(url:"füümöö fuum"^4.0) ..... which, of course, returns no hits! Why is the parser doing this - ignoring posIncrement and creating a *phrase* instead? Is this intended behavior? If yes, what is then the way to go in nutch? I assume my usecase is common for most of the languages. Thanks, Viktor -- View this message in context: http://www.nabble.com/wrong-query-when-using-token-expansion-tf4131766.html#a11750644 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
