Re: Partial / starts with searching

Robert Muir Fri, 13 Feb 2009 12:14:19 -0800

have you looked at perfieldanalyzerwrapper?

This analyzer is used to facilitate scenarios where different fields require
different analysis techniques... i'm using it for a few things and haven't
had any issues.


agree filters might be better but if you want a quick fix...

On Fri, Feb 13, 2009 at 8:52 AM, d-fader <[email protected]> wrote:

> Thanks, I will. It might take a while, but I'll post my findings.
>
>
> Erick Erickson wrote:
>
>> *really* think about using Filters here for the user permissions.
>> I think you'd be surprised how quickly you could construct one,
>> and depending upon how many users you have you might be able
>> to get some benefit from CachingWrapperFilter.
>>
>> And ignore my entire diatribe about whether you can restrict
>> your wildcards to 3 or more characters, that approach doesn't
>> fit your problem space at all....
>>
>> Best
>> Erick
>>
>> On Fri, Feb 13, 2009 at 8:39 AM, d-fader <[email protected]> wrote:
>>
>>
>>
>>> Well, it worked. I indexed a test database and it indeed grew somewhat
>>> (from 16 MiB to 200 MiB :)), and it works flawlessly. Still, I can't use
>>> the
>>> result in my application :)
>>> The 'live' index database contains about 2 million documents and is used
>>> by
>>> a multi-user application. As you probably can imagine, not everyone may
>>> see
>>> everything, there are documents that can be seen by everyone, documents
>>> that
>>> can be seen by some and also documents that only can be seen by one
>>> person.
>>> At design time, since we used the StandardAnalyzer, we decided to create
>>> a
>>> field in each document in which we store the 'login name' of each user
>>> that
>>> may see the document (2 to 4 characters per user, in most cases 2) and
>>> that's where the hick-up occurs. When I index it with the
>>> NGramTokenFilter
>>> (3-5) it doesn't seem to index anything with 2 letters. I checked in Luke
>>> too, if I search for UserInitials:(JS BD), Luke's query explanation is
>>> empty. When I search for UserInitials:(ABC) it seems to do the job well
>>> but
>>> I when I search for DEFG, the query explanation looks like
>>> UserAccessInitials:"def efg defg" and that is inacceptable, since there
>>> can
>>> be a user DEFG and a user EFG available in the system.
>>>
>>> So I think in my case it just won't work, unless I rewrite the 'who may
>>> see
>>> this document' code pretty drastically, if even possible without losing
>>> too
>>> much 'searching' speed.
>>>
>>> ...or am I wrong?
>>>
>>>
>>> Karl Wettin wrote:
>>>
>>>
>>>
>>>> If you attach an NgramTokenFilter to your analyzer at index and query
>>>> time
>>>> you should be able to query for parts of the word.
>>>>
>>>>
>>>>
>>>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html
>>>>
>>>>
>>>> http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html
>>>>
>>>> The classes are available in the contrib/analyzer module.
>>>>
>>>> You might want to boost edges a bit more than inner parts, start trying
>>>> out with something like 3-5 grams.
>>>>
>>>> Be aware, this will produce a rather large index.
>>>>
>>>>
>>>>     karl
>>>>
>>>> 13 feb 2009 kl. 10.43 skrev d-fader:
>>>>
>>>>  Karl,
>>>>
>>>>
>>>>> As a matter of fact I more or less did. I'm not really into NGrams, but
>>>>> I
>>>>> read some articles about this technique and I eventually ended up at
>>>>> the
>>>>> 'Did you mean: Lucene?' article written by Tom White. To make a long
>>>>> story
>>>>> short, this solved my problem partially. I do have 2 indexes now and
>>>>> I've
>>>>> written code to extract all terms a user entered, put them through the
>>>>> suggestion engine and tries to be clever about what suggestion should
>>>>> be
>>>>> used. It includes that stop words are ignored, when the entered term
>>>>> exists
>>>>> for more than x times in the index already it's probably good (and thus
>>>>> a
>>>>> suggestion is not needed) and when there are suggestions available, the
>>>>> suggestion with the most occurences in the index is presented. After
>>>>> that
>>>>> the original query is being built up again, preserving all command
>>>>> codes
>>>>> (like ", ( ), AND, OR, etc. etc.).
>>>>> As said, this system works pretty well and mostly if there's a
>>>>> suggestion
>>>>> available, it's actually quite accurate, so thanks for this.
>>>>>
>>>>> Still, it doesn't solve my problem fully. But I think I now know why
>>>>> Lucene can't search 'truely' partially. To find a document fast, all
>>>>> terms
>>>>> are stored with a list of documents which contain the term and when a
>>>>> user
>>>>> searches, Lucene can identify the documents by comparing the terms
>>>>> entered
>>>>> to the terms on that list, right? If so, it's understandable that a
>>>>> true
>>>>> partial search never will work, but then I just don't understand how
>>>>> Google
>>>>> manages to do this :)
>>>>>
>>>>> Jori.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Karl Wettin wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi again Jori,
>>>>>>
>>>>>> did you try N-grams as suggested in the reply on -dev?
>>>>>>
>>>>>>
>>>>>>   karl
>>>>>>
>>>>>> 13 feb 2009 kl. 09.05 skrev d-fader:
>>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>
>>>>>>> I've actually posted this message in de dev mailing list earlier,
>>>>>>> because I though my 'issue' is a limitation of the functionality of
>>>>>>> Lucene, but they redirected me to this mailinglist, so I hope one of
>>>>>>> you
>>>>>>> guys can help me out :)
>>>>>>>
>>>>>>> Maybe the 'issue' I'm addressing now is discussed thouroughly
>>>>>>> already,
>>>>>>> in that case I think I need some redirection to the sources of those
>>>>>>> discussions :) Anyway, here's the thing.
>>>>>>> For all I know it's impossible to search partial words with Lucene
>>>>>>> (except the asterix method with e.g. the StandardAnalyzer -> ambul*
>>>>>>> to
>>>>>>> find ambulance). My problem with that method is that my index
>>>>>>> consists
>>>>>>> of quite a few terms. This means that if a user would search for
>>>>>>> 'ambu
>>>>>>> amster' (ambulance amsterdam), there will be so many terms to search,
>>>>>>> the waiting time is just inacceptable. Now I started thinking why
>>>>>>> it's
>>>>>>> impossible to search only a 'part' of a term or even only the 'start'
>>>>>>> of
>>>>>>> a term and the only reason I could think of was that the Index terms
>>>>>>> are
>>>>>>> stored tokenized (in that way you (of course) can't find partial
>>>>>>> terms,
>>>>>>> since the index doesn't actually contain the literal terms, but
>>>>>>> tokens
>>>>>>> instead). But Lucene can also store all terms untokenized, so in that
>>>>>>> case, in my humble opinion, a partial search would be possible, since
>>>>>>> all terms would be stored 'literally'.
>>>>>>>
>>>>>>> Maybe my thinking is wrong, I only have a black box view of Lucene,
>>>>>>> so
>>>>>>> I
>>>>>>> don't know much about indexing algorithm and all, but I just want to
>>>>>>> know if this could be done or else why not :) You see, the users of
>>>>>>> my
>>>>>>> index want to know why they can't search parts of the words they
>>>>>>> enter
>>>>>>> and I still can't give them a really good answer, except the 'it
>>>>>>> would
>>>>>>> result in too many OR operators in the query' statement :) . I've
>>>>>>> tried
>>>>>>> using a Dutch stemmer (most of the data I'm indexing is Dutch) but
>>>>>>> that
>>>>>>> didn't work out quite good. Furthermore users sometimes search for a
>>>>>>> certain 'filename' and mostly they just enter a part of the name and
>>>>>>> thus don't find anything.
>>>>>>>
>>>>>>> I hope someone can enlighten me :) Thanks in advance!
>>>>>>>
>>>>>>> Jori
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: Partial / starts with searching

Reply via email to