On Thu, Nov 13, 2003 at 09:22:57AM +0100, Hackl, Rene wrote: > Hi John, > > Indeed, the RCO index is ok for prefix-style wildcards. But it doesn't work > for _simultaneous_ left and right truncation ("*oba*"). I have no idea about > how often this kind of search is actually employed, but in this particular > context it is really needed (I sketched this before on this list, in brief: > documents contain very long strings for chemical substances, users are > interested in certain parts of the string e.g. find all documents that > comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene").
If you can figure out how to tell Lucene what the parts of strings are when you create the index, it should be easy to do this. Otherwise, I suspect that Lucene might not be the right tool for the job (more experienced users might care to confirm). DDJ did have a recent article that mentioned a solution to a similar problem: Full-Text Searching & the Burrows-Wheeler Transform Kendall Willets Here's an indexing method that lets you find any character sequence in the source text using a structure that can fit the entire source text and index into less space than the text alone. http://www.ddj.com/articles/2003/0312/ Didn't look at it in depth though. Regards, Dror > > Suggestions on improvements are always welcome! :-) > > Best regards, > René > > > -----Ursprüngliche Nachricht----- > Von: Majerus, John P. [mailto:[EMAIL PROTECTED] > Gesendet am: Donnerstag, 13. November 2003 00:41 > An: 'Lucene Users List' > Betreff: RE: Can use Lucene be used for this > > Hello, > This has probably been put forth on the list before, but how about the > following approach for leftmost wildcard searches, at least for single term > searches? > > Reverse the character order of all words after they're stemmed and before > they're added to a special reverse-character-order index. Any time a > wildcard was found at the beginning of the search term the special index > would be engaged. Then a search for "*bar" would be converted to a search > for "rab*" on the RCO index, and the search would find "raboof", and this > result would then be unreversed upon display to yield: "foobar". > > Rene's special index could be several times larger in entry count, depending > on the average length of the contained terms. A reverse-character-order > index is the same size as its regular counterpart. > > Cheers, > John > -----Original Message----- > From: Hackl, Rene [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 12, 2003 6:34 AM > To: 'Lucene Users List' > Subject: Re: Can use Lucene be used for this > > > >> col2 like %aa% > > > Lucene doesn't handle queries where the start of the term is not known > > very efficiently. > > Is it really able to handle them at all? I thought "*foo"-type queries were > not supported. > > That's because I build two indexes for the purpose of simultaneous left and > right truncation. One "normal" index and another special one, which takes > tokens and breaks them down, for instance "foobar" would be indexed also as > "oobar" and "obar". For a query "*oba*" the left wildcard would cause the > special index to be searched for "oba*", not left truncated queries would > search the normal index. > > The special index is created with maxFieldLength = 100000 > > build-time specialIndex vs. normalIndex: +60% > index size specialIndex vs. normalIndex: +240% > index size specialIndex vs. originalDocSize: +60% > > Query execution is still very fast on a 3GB specialIndex. > > I guess the usability depends on how large your document collection is and > what kind of search functionality you need. The drawbacks of this approach > are that proximity and phrase searches on the special index are busted. > > Would it make sense to prevent creating the prx-file to reduce index size > when not offering that kind of search anyway? Is it possible at all? > > Best regards, > René > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]