Re: Can use Lucene be used for this

Dror Matalon Thu, 13 Nov 2003 01:40:15 -0800

On Thu, Nov 13, 2003 at 09:22:57AM +0100, Hackl, Rene wrote:
> Hi John,
> 
> Indeed, the RCO index is ok for prefix-style wildcards. But it doesn't work
> for _simultaneous_ left and right truncation ("*oba*"). I have no idea about
> how often this kind of search is actually employed, but in this particular
> context it is really needed (I sketched this before on this list, in brief:
> documents contain very long strings for chemical substances, users are
> interested in certain parts of the string e.g. find all documents that
> comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene").


If you can figure out how to tell Lucene what the parts of strings are
when you create the index, it should be easy to do this. Otherwise, I
suspect that Lucene might not be the right tool for the job (more
experienced users might care to confirm).

DDJ did have a recent article that mentioned a solution to a similar
problem:

        Full-Text Searching & the Burrows-Wheeler Transform Kendall Willets

        Here's an indexing method that lets you find any character sequence in
        the source text using a structure that can fit the entire source text
        and index into less space than the text alone.

        http://www.ddj.com/articles/2003/0312/

Didn't look at it in depth though.

Regards,

Dror

> 
> Suggestions on improvements are always welcome! :-)
> 
> Best regards,
> René
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Majerus, John P. [mailto:[EMAIL PROTECTED]
> Gesendet am: Donnerstag, 13. November 2003 00:41
> An: 'Lucene Users List'
> Betreff: RE: Can use Lucene be used for this
> 
> Hello,
> This has probably been put forth on the list before, but how about the
> following approach for leftmost wildcard searches, at least for single term
> searches?
> 
> Reverse the character order of all words after they're stemmed and before
> they're added to a special reverse-character-order index. Any time a
> wildcard was found at the beginning of the search term the special index
> would be engaged. Then a search for "*bar" would be converted to a search
> for "rab*" on the RCO index, and the search would find "raboof", and this
> result would then be unreversed upon display to yield: "foobar". 
> 
> Rene's special index could be several times larger in entry count, depending
> on the average length of the contained terms. A reverse-character-order
> index is the same size as its regular counterpart.
> 
> Cheers,
> John
> -----Original Message-----
> From: Hackl, Rene [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 12, 2003 6:34 AM
> To: 'Lucene Users List'
> Subject: Re: Can use Lucene be used for this
> 
> 
> >> col2 like %aa%
> 
> > Lucene doesn't handle queries where the start of the term is not known
> > very efficiently.
> 
> Is it really able to handle them at all? I thought "*foo"-type queries were
> not supported.
> 
> That's because I build two indexes for the purpose of simultaneous left and
> right truncation. One "normal" index and another special one, which takes
> tokens and breaks them down, for instance "foobar" would be indexed also as
> "oobar" and "obar". For a query "*oba*" the left wildcard would cause the
> special index to be searched for "oba*", not left truncated queries would
> search the normal index.
> 
> The special index is created with maxFieldLength = 100000
> 
> build-time specialIndex vs. normalIndex: +60%
> index size specialIndex vs. normalIndex: +240%
> index size specialIndex vs. originalDocSize: +60%
> 
> Query execution is still very fast on a 3GB specialIndex. 
> 
> I guess the usability depends on how large your document collection is and
> what kind of search functionality you need. The drawbacks of this approach
> are that proximity and phrase searches on the special index are busted. 
> 
> Would it make sense to prevent creating the prx-file to reduce index size
> when not offering that kind of search anyway? Is it possible at all?
> 
> Best regards,
> René
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can use Lucene be used for this

Reply via email to