Pathiyil, Praveen wrote:
How do we query for words like "C++" or "[EMAIL PROTECTED]" in Nutch ? I tried to modify NutchAnalysis.jj so that when we get a quoted string, it is not stripped of characters like +, @ and .

With this change, the <query object>.toString() gives me +((+url:"c++"^4.0) (+anchor:"c++"^2.0) (+content:"c++"))

However this does not give me any hits, though there are documents with the word C++ in my index.

You also need to tokenize the documents correctly when indexing. Currently, a plus (+) is never included in a token.


One could change this generally, e.g., to include plus in words anywhere but in the first character, but I think that's probably not a great idea, since folks might write things like "man+woman=child", which should probably not be a single token.

One could instead only permit plusses at the end of tokens. This appears to be what Google does. (Evidence: "man+woman" matches "man woman", "c+" matches "c+-", "c++" matches "c++", and "man++" matches "man++").

Or one could make a special case for "c++". This looks like what Yahoo! does. (Evidence: "man+woman" matches "man woman", "c+" matches "c", "c++" matches "c++", and "man++" matches "man").

I prefer Yahoo!'s approach. "C++" is not normal natural language punctuation, anymore than "Yahoo!" is. A page containing "man++" should probably be returned when someone searches for "man".

I've attached a patch that implements this.  Does it work for you?

As for "[EMAIL PROTECTED]", this is currently tokenized as "xyz", "abc", "net", and query with this is automatically converted into a phrase search. So this would also match documents with "[EMAIL PROTECTED]", but that's not really a problem.

Doug
Index: src/java/net/nutch/analysis/NutchAnalysis.jj
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/analysis/NutchAnalysis.jj,v
retrieving revision 1.8
diff -u -r1.8 NutchAnalysis.jj
--- src/java/net/nutch/analysis/NutchAnalysis.jj	9 Jul 2004 21:27:00 -0000	1.8
+++ src/java/net/nutch/analysis/NutchAnalysis.jj	1 Sep 2004 17:09:29 -0000
@@ -79,7 +79,7 @@
 TOKEN : {					  // token regular expressions
 
   // basic word -- lowercase it
-<WORD: (<LETTER>|<DIGIT>|<WORD_PUNCT>)+>
+<WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
   { matchedToken.image = matchedToken.image.toLowerCase(); }
 
   // special handling for acronyms: U.S.A., I.B.M., etc: dots are removed
@@ -95,6 +95,11 @@
   // chinese, japanese and korean characters
 | <SIGRAM: <CJK> >
 
+   // irregular words
+| <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
+| <#C_PLUS_PLUS: ("C"|"c") "++" >
+| <#C_SHARP: ("C"|"c") "#" >
+
   // query syntax characters
 | <PLUS: "+" >
 | <MINUS: "-" >

Reply via email to