Re: Partial / starts with searching

d-fader Fri, 13 Feb 2009 05:55:13 -0800

Thanks, I will. It might take a while, but I'll post my findings.


Erick Erickson wrote:

*really* think about using Filters here for the user permissions.
I think you'd be surprised how quickly you could construct one,
and depending upon how many users you have you might be able
to get some benefit from CachingWrapperFilter.

And ignore my entire diatribe about whether you can restrict
your wildcards to 3 or more characters, that approach doesn't
fit your problem space at all....

Best
Erick

On Fri, Feb 13, 2009 at 8:39 AM, d-fader <dfa...@gmail.com> wrote:

Well, it worked. I indexed a test database and it indeed grew somewhat
(from 16 MiB to 200 MiB :)), and it works flawlessly. Still, I can't use the
result in my application :)
The 'live' index database contains about 2 million documents and is used by
a multi-user application. As you probably can imagine, not everyone may see
everything, there are documents that can be seen by everyone, documents that
can be seen by some and also documents that only can be seen by one person.
At design time, since we used the StandardAnalyzer, we decided to create a
field in each document in which we store the 'login name' of each user that
may see the document (2 to 4 characters per user, in most cases 2) and
that's where the hick-up occurs. When I index it with the NGramTokenFilter
(3-5) it doesn't seem to index anything with 2 letters. I checked in Luke
too, if I search for UserInitials:(JS BD), Luke's query explanation is
empty. When I search for UserInitials:(ABC) it seems to do the job well but
I when I search for DEFG, the query explanation looks like
UserAccessInitials:"def efg defg" and that is inacceptable, since there can
be a user DEFG and a user EFG available in the system.

So I think in my case it just won't work, unless I rewrite the 'who may see
this document' code pretty drastically, if even possible without losing too
much 'searching' speed.

...or am I wrong?


Karl Wettin wrote:

If you attach an NgramTokenFilter to your analyzer at index and query time
you should be able to query for parts of the word.


http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html

http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html

The classes are available in the contrib/analyzer module.

You might want to boost edges a bit more than inner parts, start trying
out with something like 3-5 grams.

Be aware, this will produce a rather large index.


     karl

13 feb 2009 kl. 10.43 skrev d-fader:

 Karl,

As a matter of fact I more or less did. I'm not really into NGrams, but I
read some articles about this technique and I eventually ended up at the
'Did you mean: Lucene?' article written by Tom White. To make a long story
short, this solved my problem partially. I do have 2 indexes now and I've
written code to extract all terms a user entered, put them through the
suggestion engine and tries to be clever about what suggestion should be
used. It includes that stop words are ignored, when the entered term exists
for more than x times in the index already it's probably good (and thus a
suggestion is not needed) and when there are suggestions available, the
suggestion with the most occurences in the index is presented. After that
the original query is being built up again, preserving all command codes
(like ", ( ), AND, OR, etc. etc.).
As said, this system works pretty well and mostly if there's a suggestion
available, it's actually quite accurate, so thanks for this.

Still, it doesn't solve my problem fully. But I think I now know why
Lucene can't search 'truely' partially. To find a document fast, all terms
are stored with a list of documents which contain the term and when a user
searches, Lucene can identify the documents by comparing the terms entered
to the terms on that list, right? If so, it's understandable that a true
partial search never will work, but then I just don't understand how Google
manages to do this :)

Jori.




Karl Wettin wrote:

Hi again Jori,

did you try N-grams as suggested in the reply on -dev?


   karl

13 feb 2009 kl. 09.05 skrev d-fader:

 Hi,

I've actually posted this message in de dev mailing list earlier,
because I though my 'issue' is a limitation of the functionality of
Lucene, but they redirected me to this mailinglist, so I hope one of
you
guys can help me out :)

Maybe the 'issue' I'm addressing now is discussed thouroughly already,
in that case I think I need some redirection to the sources of those
discussions :) Anyway, here's the thing.
For all I know it's impossible to search partial words with Lucene
(except the asterix method with e.g. the StandardAnalyzer -> ambul* to
find ambulance). My problem with that method is that my index consists
of quite a few terms. This means that if a user would search for 'ambu
amster' (ambulance amsterdam), there will be so many terms to search,
the waiting time is just inacceptable. Now I started thinking why it's
impossible to search only a 'part' of a term or even only the 'start'
of
a term and the only reason I could think of was that the Index terms
are
stored tokenized (in that way you (of course) can't find partial terms,
since the index doesn't actually contain the literal terms, but tokens
instead). But Lucene can also store all terms untokenized, so in that
case, in my humble opinion, a partial search would be possible, since
all terms would be stored 'literally'.

Maybe my thinking is wrong, I only have a black box view of Lucene, so
I
don't know much about indexing algorithm and all, but I just want to
know if this could be done or else why not :) You see, the users of my
index want to know why they can't search parts of the words they enter
and I still can't give them a really good answer, except the 'it would
result in too many OR operators in the query' statement :) . I've tried
using a Dutch stemmer (most of the data I'm indexing is Dutch) but that
didn't work out quite good. Furthermore users sometimes search for a
certain 'filename' and mostly they just enter a part of the name and
thus don't find anything.

I hope someone can enlighten me :) Thanks in advance!

Jori

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Partial / starts with searching

Reply via email to