Re: Partial / starts with searching

Karl Wettin Sat, 14 Feb 2009 03:28:19 -0800

You probably only want to use Ngrams for the text fields, leaving theuser name field untokenized. As for loosing text field words less than3 characters long: consider letting them through, perhaps byimplementing a filter that pass longer word to an Ngram filter whileyou just return the shorter input tokens.


     karl


13 feb 2009 kl. 14.39 skrev d-fader:

Well, it worked. I indexed a test database and it indeed grewsomewhat (from 16 MiB to 200 MiB :)), and it works flawlessly.Still, I can't use the result in my application :)The 'live' index database contains about 2 million documents and isused by a multi-user application. As you probably can imagine, noteveryone may see everything, there are documents that can be seen byeveryone, documents that can be seen by some and also documents thatonly can be seen by one person. At design time, since we used theStandardAnalyzer, we decided to create a field in each document inwhich we store the 'login name' of each user that may see thedocument (2 to 4 characters per user, in most cases 2) and that'swhere the hick-up occurs. When I index it with the NGramTokenFilter(3-5) it doesn't seem to index anything with 2 letters. I checked inLuke too, if I search for UserInitials:(JS BD), Luke's queryexplanation is empty. When I search for UserInitials:(ABC) it seemsto do the job well but I when I search for DEFG, the queryexplanation looks like UserAccessInitials:"def efg defg" and that isinacceptable, since there can be a user DEFG and a user EFGavailable in the system.
So I think in my case it just won't work, unless I rewrite the 'whomay see this document' code pretty drastically, if even possiblewithout losing too much 'searching' speed.
...or am I wrong?

Karl Wettin wrote:
If you attach an NgramTokenFilter to your analyzer at index andquery time you should be able to query for parts of the word.
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html

The classes are available in the contrib/analyzer module.
You might want to boost edges a bit more than inner parts, starttrying out with something like 3-5 grams.
Be aware, this will produce a rather large index.


     karl

13 feb 2009 kl. 10.43 skrev d-fader:
Karl,
As a matter of fact I more or less did. I'm not really intoNGrams, but I read some articles about this technique and Ieventually ended up at the 'Did you mean: Lucene?' article writtenby Tom White. To make a long story short, this solved my problempartially. I do have 2 indexes now and I've written code toextract all terms a user entered, put them through the suggestionengine and tries to be clever about what suggestion should beused. It includes that stop words are ignored, when the enteredterm exists for more than x times in the index already it'sprobably good (and thus a suggestion is not needed) and when thereare suggestions available, the suggestion with the most occurencesin the index is presented. After that the original query is beingbuilt up again, preserving all command codes (like ", ( ), AND,OR, etc. etc.).As said, this system works pretty well and mostly if there's asuggestion available, it's actually quite accurate, so thanks forthis.
Still, it doesn't solve my problem fully. But I think I now knowwhy Lucene can't search 'truely' partially. To find a documentfast, all terms are stored with a list of documents which containthe term and when a user searches, Lucene can identify thedocuments by comparing the terms entered to the terms on thatlist, right? If so, it's understandable that a true partial searchnever will work, but then I just don't understand how Googlemanages to do this :)
Jori.




Karl Wettin wrote:
Hi again Jori,

did you try N-grams as suggested in the reply on -dev?


   karl

13 feb 2009 kl. 09.05 skrev d-fader:
Hi,

I've actually posted this message in de dev mailing list earlier,
because I though my 'issue' is a limitation of the functionalityofLucene, but they redirected me to this mailinglist, so I hopeone of you
guys can help me out :)
Maybe the 'issue' I'm addressing now is discussed thouroughlyalready,in that case I think I need some redirection to the sources ofthose
discussions :) Anyway, here's the thing.
For all I know it's impossible to search partial words with Lucene
(except the asterix method with e.g. the StandardAnalyzer ->ambul* tofind ambulance). My problem with that method is that my indexconsistsof quite a few terms. This means that if a user would search for'ambuamster' (ambulance amsterdam), there will be so many terms tosearch,the waiting time is just inacceptable. Now I started thinkingwhy it'simpossible to search only a 'part' of a term or even only the'start' ofa term and the only reason I could think of was that the Indexterms arestored tokenized (in that way you (of course) can't find partialterms,since the index doesn't actually contain the literal terms, buttokensinstead). But Lucene can also store all terms untokenized, so inthatcase, in my humble opinion, a partial search would be possible,since
all terms would be stored 'literally'.
Maybe my thinking is wrong, I only have a black box view ofLucene, so Idon't know much about indexing algorithm and all, but I justwant toknow if this could be done or else why not :) You see, the usersof myindex want to know why they can't search parts of the words theyenterand I still can't give them a really good answer, except the 'itwouldresult in too many OR operators in the query' statement :) .I've triedusing a Dutch stemmer (most of the data I'm indexing is Dutch)but thatdidn't work out quite good. Furthermore users sometimes searchfor acertain 'filename' and mostly they just enter a part of the nameand
thus don't find anything.

I hope someone can enlighten me :) Thanks in advance!

Jori

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Partial / starts with searching

Reply via email to