Hi Doug,

Thanks a lot for the elaborate explanation. At this point, I would like to 
take the second approach and accept optional pluses at the end of a <WORD> in 
addition to the original ones. I made that change and it gave me the correct 
results.

Regards,
Praveen.

Doug Cutting <[EMAIL PROTECTED]> said:

> Pathiyil, Praveen wrote:
> > How do we query for words like "C++" or "[EMAIL PROTECTED]" in Nutch ? I tried 
to 
> > modify NutchAnalysis.jj so that when we get a quoted string, it is not 
> > stripped of characters like +, @ and .
> > 
> > With this change, the <query object>.toString() gives me 
> >  +((+url:"c++"^4.0) (+anchor:"c++"^2.0) (+content:"c++"))
> > 
> > However this does not give me any hits, though there are documents with 
the 
> > word C++ in my index.
> 
> You also need to tokenize the documents correctly when indexing. 
> Currently, a plus (+) is never included in a token.
> 
> One could change this generally, e.g., to include plus in words anywhere 
> but in the first character, but I think that's probably not a great 
> idea, since folks might write things like "man+woman=child", which 
> should probably not be a single token.
> 
> One could instead only permit plusses at the end of tokens.  This 
> appears to be what Google does.  (Evidence: "man+woman" matches "man 
> woman", "c+" matches "c+-", "c++" matches "c++", and "man++" matches 
> "man++").
> 
> Or one could make a special case for "c++".  This looks like what Yahoo! 
> does.  (Evidence: "man+woman" matches "man woman", "c+" matches "c", 
> "c++" matches "c++", and "man++" matches "man").
> 
> I prefer Yahoo!'s approach.  "C++" is not normal natural language 
> punctuation, anymore than "Yahoo!" is.  A page containing "man++" should 
> probably be returned when someone searches for "man".
> 
> I've attached a patch that implements this.  Does it work for you?
> 
> As for "[EMAIL PROTECTED]", this is currently tokenized as "xyz", "abc", 
> "net", and query with this is automatically converted into a phrase 
> search.  So this would also match documents with "[EMAIL PROTECTED]", but 
> that's not really a problem.
> 
> Doug
> 



-- 






-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to