I was wondering why the query syntax is so limited. There are no OR queries, there are no fielded queries, or fuzzy, or approximate... Why?
The idea was to start simple, with the basics, and also stick to things which are efficient. Fuzzy searches are very expensive. I know of no web search engine which permits these. Google only added OR queries recently. Nutch starts with +, -, and phrases. That probably accounts for 99% of web searches. But adding OR, site:, domain:, link:, format: etc. would be useful.
The underlying index supports all these operations. It would sometimes make a lot of sense to ask a query like "site:www.nutch.org" instead of "www nutch org" and hoping that the scoring algorithm gets it right...
I hope to make the query translation layer more extensible through the use of colon-terminated "fields", as you indicate. It's hard to make the parser extensible, so my idea is to have a single, slightly more general query syntax, and then permit folks to plugin alternate translators from Nutch to Lucene queries. The Nutch query data structure already stores a field per term, but the query parser doesn't yet use this. My idea would be to have the query parser parse these and pass them through to the translator, then permit plugins which could process these. By default, unrecognized fields would result in errors. A "site" field handler would probably be one of the first to implement. But none of this has yet been implemented.
Again, if you'd like to work on this, we'd welcome your contributions.
Doug
------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
