Hello,
Allen Atamer [mailto:[EMAIL PROTECTED]
The Javadoc spec calls for one or more clauses in a query,
but I had trouble with a NOT query just on its own. For example
Most search engines including Lucene doesn't support this query type.
That's why the query parser treats this queries as invalid (if
it recognizes them).
QueryParser.parse(my_field:-exclude) throws a parsing exception
Same with
QueryParser.parse(my_field:-(exclude))
QueryParser.parse(my_field:(* AND -exclude)
These queries are defined as invalid.
The query QueryParser.parse(my_field:(-(exclude))) gives a
legitimate query that brings no results.
This is the result returned by many search engines. They select a document
set by searching for the query terms first and get the documents which
contains
them from index. Because your query doesn't contains any term which should
be part of a document, your result is empty. Note that they check the
excluded words only for documents selected during the first step to save
time and memory.
What I would expect is the following: If I have an index with
100 total entries, and 20 records with the word exclude in
them, then the above queries should give 80 hits. There is no
test case for this scenario in TestQueryParser. Please
confirm whether this is a bug or not,
This is no error. You might think, that it should be easy to
get all documents and discard all of them which contains the
exluded word.
But imaging your index contains about 1 million documents, the search
requires a very long time and the result will be very huge (and so
mostly useless). Many Engines avoid this trouble and return an empty
result or report an error.
You can workaround this limitation by adding a dummy field and
term to each document while creating your index, e.g.
document.add(Field.Text(dummy_field, dummy_value));
You have to add dummy_field:dummy_value to your query string, e.g.:
dummy_field:dummy_value my_field:-(exclude) will returns 80 hits
in your scenario (after rebuild your index).
Warning: the performance will be very poor on larger indexes and
users can run denial of service attacks by sending such queries.
Note that a single * is not allowed too, because performance would be
very poor if expression starts with a wildcard. In this case Lucene
would run of memory too, because the internal result contains all
words/terms stored in your index.
Regards,
Wolf-Dietrich Materna
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]