Hi Doug, Thanks a lot for the elaborate explanation. At this point, I would like to take the second approach and accept optional pluses at the end of a <WORD> in addition to the original ones. I made that change and it gave me the correct results.
Regards, Praveen. Doug Cutting <[EMAIL PROTECTED]> said: > Pathiyil, Praveen wrote: > > How do we query for words like "C++" or "[EMAIL PROTECTED]" in Nutch ? I tried to > > modify NutchAnalysis.jj so that when we get a quoted string, it is not > > stripped of characters like +, @ and . > > > > With this change, the <query object>.toString() gives me > > +((+url:"c++"^4.0) (+anchor:"c++"^2.0) (+content:"c++")) > > > > However this does not give me any hits, though there are documents with the > > word C++ in my index. > > You also need to tokenize the documents correctly when indexing. > Currently, a plus (+) is never included in a token. > > One could change this generally, e.g., to include plus in words anywhere > but in the first character, but I think that's probably not a great > idea, since folks might write things like "man+woman=child", which > should probably not be a single token. > > One could instead only permit plusses at the end of tokens. This > appears to be what Google does. (Evidence: "man+woman" matches "man > woman", "c+" matches "c+-", "c++" matches "c++", and "man++" matches > "man++"). > > Or one could make a special case for "c++". This looks like what Yahoo! > does. (Evidence: "man+woman" matches "man woman", "c+" matches "c", > "c++" matches "c++", and "man++" matches "man"). > > I prefer Yahoo!'s approach. "C++" is not normal natural language > punctuation, anymore than "Yahoo!" is. A page containing "man++" should > probably be returned when someone searches for "man". > > I've attached a patch that implements this. Does it work for you? > > As for "[EMAIL PROTECTED]", this is currently tokenized as "xyz", "abc", > "net", and query with this is automatically converted into a phrase > search. So this would also match documents with "[EMAIL PROTECTED]", but > that's not really a problem. > > Doug > -- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
