On Oct 13, 2005, at 7:36 AM, Mikko Noromaa wrote:
Hi,


It would be possible to do a PatternQuery("*") that would
enumerate every term.


Does this work differently than the current logic where wildcard queries are constructed as BooleanQueries with many terms OR'ed together? I think this
would be a good change.

No - it works identically to WildcardQuery, with the only difference being how it matches. The added bonus though is that there is a SpanPatternQuery to go along with this, allowing for "foo bar*" phrase queries.

I have always thought that it is quite cumbersome to expand wildcards to many boolean clauses. I think that keeping the wildcard (or regex in this case) in the query object would be much better. On the other hand, it might not make any difference in performance, since Lucene would still have to go
through all the terms. But at least it would avoid the
BooleanQuery$TooManyClauses exception even with thousands of different
terms. Right?

At this point, the possibility of that exception still exists so increasing the maximum number of clauses is necessary to avoid it.

I know I can increase the limit of the boolean queries, but there is still a limit. In my application, I index Finnish text which has lots of different suffixes for the same word. With compound words included, I could easily imagine that the same base word may have hundreds or thousands of terms in
the index.

Hundreds is still under the 1024 built-in restriction for BooleanQuery. Thousands is do-able by increasing the limit and having sufficient RAM.

For suffix-wildcards, there really is no difference between my PatternQuery and WildcardQuery - WildcardQuery may even be faster if it's matching is quicker than regex (though tests would need to be performed to confirm, I'd guess that the performance difference isn't all that much).

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to