Re: Partial / starts with searching

Karl Wettin Fri, 13 Feb 2009 01:52:45 -0800

If you attach an NgramTokenFilter to your analyzer at index and querytime you should be able to query for parts of the word.


http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html


The classes are available in the contrib/analyzer module.

You might want to boost edges a bit more than inner parts, starttrying out with something like 3-5 grams.


Be aware, this will produce a rather large index.


      karl

13 feb 2009 kl. 10.43 skrev d-fader:

Karl,
As a matter of fact I more or less did. I'm not really into NGrams,but I read some articles about this technique and I eventually endedup at the 'Did you mean: Lucene?' article written by Tom White. Tomake a long story short, this solved my problem partially. I do have2 indexes now and I've written code to extract all terms a userentered, put them through the suggestion engine and tries to beclever about what suggestion should be used. It includes that stopwords are ignored, when the entered term exists for more than xtimes in the index already it's probably good (and thus a suggestionis not needed) and when there are suggestions available, thesuggestion with the most occurences in the index is presented. Afterthat the original query is being built up again, preserving allcommand codes (like ", ( ), AND, OR, etc. etc.).As said, this system works pretty well and mostly if there's asuggestion available, it's actually quite accurate, so thanks forthis.
Still, it doesn't solve my problem fully. But I think I now know whyLucene can't search 'truely' partially. To find a document fast, allterms are stored with a list of documents which contain the term andwhen a user searches, Lucene can identify the documents by comparingthe terms entered to the terms on that list, right? If so, it'sunderstandable that a true partial search never will work, but thenI just don't understand how Google manages to do this :)
Jori.




Karl Wettin wrote:
Hi again Jori,

did you try N-grams as suggested in the reply on -dev?


    karl

13 feb 2009 kl. 09.05 skrev d-fader:
Hi,

I've actually posted this message in de dev mailing list earlier,
because I though my 'issue' is a limitation of the functionality of
Lucene, but they redirected me to this mailinglist, so I hope oneof you
guys can help me out :)
Maybe the 'issue' I'm addressing now is discussed thouroughlyalready,
in that case I think I need some redirection to the sources of those
discussions :) Anyway, here's the thing.
For all I know it's impossible to search partial words with Lucene
(except the asterix method with e.g. the StandardAnalyzer ->ambul* tofind ambulance). My problem with that method is that my indexconsistsof quite a few terms. This means that if a user would search for'ambuamster' (ambulance amsterdam), there will be so many terms tosearch,the waiting time is just inacceptable. Now I started thinking whyit'simpossible to search only a 'part' of a term or even only the'start' ofa term and the only reason I could think of was that the Indexterms arestored tokenized (in that way you (of course) can't find partialterms,since the index doesn't actually contain the literal terms, buttokensinstead). But Lucene can also store all terms untokenized, so inthatcase, in my humble opinion, a partial search would be possible,since
all terms would be stored 'literally'.
Maybe my thinking is wrong, I only have a black box view ofLucene, so I
don't know much about indexing algorithm and all, but I just want to
know if this could be done or else why not :) You see, the usersof myindex want to know why they can't search parts of the words theyenterand I still can't give them a really good answer, except the 'itwouldresult in too many OR operators in the query' statement :) . I'vetriedusing a Dutch stemmer (most of the data I'm indexing is Dutch) butthat
didn't work out quite good. Furthermore users sometimes search for a
certain 'filename' and mostly they just enter a part of the name and
thus don't find anything.

I hope someone can enlighten me :) Thanks in advance!

Jori

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Partial / starts with searching

Reply via email to