Re: Starts With x and Ends With x Queries

Luke Shannon Mon, 07 Feb 2005 15:49:26 -0800

I implemented this concept for my ends with query. It works very well!

----- Original Message ----- 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, February 04, 2005 9:37 PM
Subject: Re: Starts With x and Ends With x Queries



>
> : Also keep in mind that QueryParser only allows a trailing asterisk,
> : creating a PrefixQuery.  However, if you use a WildcardQuery directly,
> : you can use an asterisk as the starting character (at the risk of
> : performance).
>
> On the issue of "ends with" wildcard queries, I wanted to throw out and
> idea that i've seen used to deal with matches like this in other systems.
> I've never acctually tried this with Lucene, but I've seen it used
> effectively with other systems where the goal is to "sort" strings by the
> least significant (ie: right most) characters first.  I think it could
> apply nicely to people who have compelling needs for efficent 'ends with'
> queries.
>
>
>
> Imagine you have a field call name, which you can already do efficient
> prefix matching on using the PrefixQuery class.  Your docs and query may
> look something like this...
>
>    D1> name:"Adam Smith" age:13 state:CA ...
>    D2> name:"Joe Bob" age:42 state:WA ...
>    D3> name:"John Adams" age:35 state:NV ...
>    D3> name:"Sue Smith" age:33 state:CA ...
>
> ...and your queries may look something like...
>
>    Query q1 = new PrefixQuery(new Term("name","J*"));
>    Query q2 = new PrefixQuery(new Term("name","Sue*"));
>
> If you want to start doing suffix queries (ie: all names ending with
> "s", or all names ending with "Smith") one approach would be to use
> WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
> that starts with a "*". ie...
>
>    Query q3 = new WildcardQuery(new Term("name","*s"));
>    Query q4 = new WildcardQuery(new Term("name","*Smith"));
>
> (NOTE: Erik says you can do this, but the docs for WildcardQuery say you
> can't I'll assume the docs are wrong and Erik is correct.)
>
> The problem is that this is horrendously inefficient.  In order to find
> the docs that contain Terms which match your suffix, WildcardQuery must
> first identify what all of those Terms are, by iterating over every Term
> in your index to see if they match the suffix.  This is much slower then a
> PrefixQuery, or even a WildcardQuery that has just 1 initial character
> before a "*" (ie: "s*foobar"), because it can then seek to directly to the
> first Term that starts with that character, and also stop iterating as
> soon as it encounters a Term that no longer begins with that character.
>
> Which leads me to my point: if you denormalize your data so that you store
> both the Term you want, and the *reverse* of the term you want, then a
> Suffix query is just a Prefix query on a reversed field -- by sacrificing
> space, you can get all the speed efficiencies of a PrefixQuery when doing
> a SuffixQuery...
>
>    D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
>    D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
>    D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
>    D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
>
>    Query q1 = new PrefixQuery(new Term("name","J*"));
>    Query q2 = new PrefixQuery(new Term("name","Sue*"));
>    Query q3 = new PrefixQuery(new Term("rname","s*"));
>    Query q4 = new PrefixQuery(new Term("rname","htimS*"));
>
>
> (If anyone sees a flaw in my theory, please chime in)
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

Reply via email to