You probably only want to use Ngrams for the text fields, leaving the
user name field untokenized. As for loosing text field words less than
3 characters long: consider letting them through, perhaps by
implementing a filter that pass longer word to an Ngram filter while
you just return the shorter input tokens.
karl
13 feb 2009 kl. 14.39 skrev d-fader:
Well, it worked. I indexed a test database and it indeed grew
somewhat (from 16 MiB to 200 MiB :)), and it works flawlessly.
Still, I can't use the result in my application :)
The 'live' index database contains about 2 million documents and is
used by a multi-user application. As you probably can imagine, not
everyone may see everything, there are documents that can be seen by
everyone, documents that can be seen by some and also documents that
only can be seen by one person. At design time, since we used the
StandardAnalyzer, we decided to create a field in each document in
which we store the 'login name' of each user that may see the
document (2 to 4 characters per user, in most cases 2) and that's
where the hick-up occurs. When I index it with the NGramTokenFilter
(3-5) it doesn't seem to index anything with 2 letters. I checked in
Luke too, if I search for UserInitials:(JS BD), Luke's query
explanation is empty. When I search for UserInitials:(ABC) it seems
to do the job well but I when I search for DEFG, the query
explanation looks like UserAccessInitials:"def efg defg" and that is
inacceptable, since there can be a user DEFG and a user EFG
available in the system.
So I think in my case it just won't work, unless I rewrite the 'who
may see this document' code pretty drastically, if even possible
without losing too much 'searching' speed.
...or am I wrong?
Karl Wettin wrote:
If you attach an NgramTokenFilter to your analyzer at index and
query time you should be able to query for parts of the word.
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html
The classes are available in the contrib/analyzer module.
You might want to boost edges a bit more than inner parts, start
trying out with something like 3-5 grams.
Be aware, this will produce a rather large index.
karl
13 feb 2009 kl. 10.43 skrev d-fader:
Karl,
As a matter of fact I more or less did. I'm not really into
NGrams, but I read some articles about this technique and I
eventually ended up at the 'Did you mean: Lucene?' article written
by Tom White. To make a long story short, this solved my problem
partially. I do have 2 indexes now and I've written code to
extract all terms a user entered, put them through the suggestion
engine and tries to be clever about what suggestion should be
used. It includes that stop words are ignored, when the entered
term exists for more than x times in the index already it's
probably good (and thus a suggestion is not needed) and when there
are suggestions available, the suggestion with the most occurences
in the index is presented. After that the original query is being
built up again, preserving all command codes (like ", ( ), AND,
OR, etc. etc.).
As said, this system works pretty well and mostly if there's a
suggestion available, it's actually quite accurate, so thanks for
this.
Still, it doesn't solve my problem fully. But I think I now know
why Lucene can't search 'truely' partially. To find a document
fast, all terms are stored with a list of documents which contain
the term and when a user searches, Lucene can identify the
documents by comparing the terms entered to the terms on that
list, right? If so, it's understandable that a true partial search
never will work, but then I just don't understand how Google
manages to do this :)
Jori.
Karl Wettin wrote:
Hi again Jori,
did you try N-grams as suggested in the reply on -dev?
karl
13 feb 2009 kl. 09.05 skrev d-fader:
Hi,
I've actually posted this message in de dev mailing list earlier,
because I though my 'issue' is a limitation of the functionality
of
Lucene, but they redirected me to this mailinglist, so I hope
one of you
guys can help me out :)
Maybe the 'issue' I'm addressing now is discussed thouroughly
already,
in that case I think I need some redirection to the sources of
those
discussions :) Anyway, here's the thing.
For all I know it's impossible to search partial words with Lucene
(except the asterix method with e.g. the StandardAnalyzer ->
ambul* to
find ambulance). My problem with that method is that my index
consists
of quite a few terms. This means that if a user would search for
'ambu
amster' (ambulance amsterdam), there will be so many terms to
search,
the waiting time is just inacceptable. Now I started thinking
why it's
impossible to search only a 'part' of a term or even only the
'start' of
a term and the only reason I could think of was that the Index
terms are
stored tokenized (in that way you (of course) can't find partial
terms,
since the index doesn't actually contain the literal terms, but
tokens
instead). But Lucene can also store all terms untokenized, so in
that
case, in my humble opinion, a partial search would be possible,
since
all terms would be stored 'literally'.
Maybe my thinking is wrong, I only have a black box view of
Lucene, so I
don't know much about indexing algorithm and all, but I just
want to
know if this could be done or else why not :) You see, the users
of my
index want to know why they can't search parts of the words they
enter
and I still can't give them a really good answer, except the 'it
would
result in too many OR operators in the query' statement :) .
I've tried
using a Dutch stemmer (most of the data I'm indexing is Dutch)
but that
didn't work out quite good. Furthermore users sometimes search
for a
certain 'filename' and mostly they just enter a part of the name
and
thus don't find anything.
I hope someone can enlighten me :) Thanks in advance!
Jori
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org