Tom,

Very cool! Thanks for sharing your technique, which works well for prefixed and suffixed wildcard queries. However, it doesn't address an * in the middle of a term, say W*D. Obviously your usage doesn't require better performance for a wildcard in the middle, so you've done well - I just wanted to point out the one caveat for others. A prefixed wildcard is the worst performer, though, so you've nipped the major one.

    Erik

On Oct 7, 2005, at 9:17 AM, Aigner, Thomas wrote:

Thanks Erik, I tried the reverse index and it worked like a charm.
While I was doing this, we figured out a way to handle contains within
search and wildcard searches at the beginning. I thought I would share
it with the community (and realized it handled the reverse index as
well)

Word: ABCDEFG

Tokens created:
    <ABCDEFG
    BCDEFG
    CDEFG
    DEFG
    EFG
    FG

What I do is if the search string is :
    WORD*   I search for <WORD*
    *WORD   I search for WORD*
    *WORD*  I search for WORD*
      WORD    I search for <WORD

With this technique, the search result time was decreased tremendously
for contains within and wildcard searches from the beginning. The index
has become 5X as large and takes longer to build, but I'm willing to
sacrifice disk space and time for this huge benefit of speed.  Also, I
have taken the wildcard query completely out of the program now so
everything uses my customized analyzer.

Tom

-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 05, 2005 9:27 AM
To: java-user@lucene.apache.org
Subject: Re: Optimization


On Oct 5, 2005, at 9:05 AM, Aigner, Thomas wrote:

    Have a question.. Is there any obvious things that can be done
to help speed up query lookups especially wildcard searches (i.e.
*lamps).


Obvious?  Sort of.  *lamps needs to scan through _every_ single term
in the index (for the specified field only, of course) because terms
are lexicographically ordered.

If you reverse terms during analysis and lay them in the same
position (increment 0) as the original token you'd end up with
"spmal..." terms.  Now pre-process the query string and if there is a
prefixed wildcard query, reverse it so that "*lamps" turns into
"spmal*" and you will likely achieve a dramatic speed-up.

This is just one technique for dealing with prefixed wildcard
queries.  There is more fun to be had with queries like *lamps*.  A
technique I learned from the book Managing Gigabytes is to rotate
terms through all their possible variations and index all of those,
which also requires cleverness on the querying side of things.

     Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to