have you looked at perfieldanalyzerwrapper? This analyzer is used to facilitate scenarios where different fields require different analysis techniques... i'm using it for a few things and haven't had any issues.
agree filters might be better but if you want a quick fix... On Fri, Feb 13, 2009 at 8:52 AM, d-fader <dfa...@gmail.com> wrote: > Thanks, I will. It might take a while, but I'll post my findings. > > > Erick Erickson wrote: > >> *really* think about using Filters here for the user permissions. >> I think you'd be surprised how quickly you could construct one, >> and depending upon how many users you have you might be able >> to get some benefit from CachingWrapperFilter. >> >> And ignore my entire diatribe about whether you can restrict >> your wildcards to 3 or more characters, that approach doesn't >> fit your problem space at all.... >> >> Best >> Erick >> >> On Fri, Feb 13, 2009 at 8:39 AM, d-fader <dfa...@gmail.com> wrote: >> >> >> >>> Well, it worked. I indexed a test database and it indeed grew somewhat >>> (from 16 MiB to 200 MiB :)), and it works flawlessly. Still, I can't use >>> the >>> result in my application :) >>> The 'live' index database contains about 2 million documents and is used >>> by >>> a multi-user application. As you probably can imagine, not everyone may >>> see >>> everything, there are documents that can be seen by everyone, documents >>> that >>> can be seen by some and also documents that only can be seen by one >>> person. >>> At design time, since we used the StandardAnalyzer, we decided to create >>> a >>> field in each document in which we store the 'login name' of each user >>> that >>> may see the document (2 to 4 characters per user, in most cases 2) and >>> that's where the hick-up occurs. When I index it with the >>> NGramTokenFilter >>> (3-5) it doesn't seem to index anything with 2 letters. I checked in Luke >>> too, if I search for UserInitials:(JS BD), Luke's query explanation is >>> empty. When I search for UserInitials:(ABC) it seems to do the job well >>> but >>> I when I search for DEFG, the query explanation looks like >>> UserAccessInitials:"def efg defg" and that is inacceptable, since there >>> can >>> be a user DEFG and a user EFG available in the system. >>> >>> So I think in my case it just won't work, unless I rewrite the 'who may >>> see >>> this document' code pretty drastically, if even possible without losing >>> too >>> much 'searching' speed. >>> >>> ...or am I wrong? >>> >>> >>> Karl Wettin wrote: >>> >>> >>> >>>> If you attach an NgramTokenFilter to your analyzer at index and query >>>> time >>>> you should be able to query for parts of the word. >>>> >>>> >>>> >>>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/NGramTokenFilter.html >>>> >>>> >>>> http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html >>>> >>>> The classes are available in the contrib/analyzer module. >>>> >>>> You might want to boost edges a bit more than inner parts, start trying >>>> out with something like 3-5 grams. >>>> >>>> Be aware, this will produce a rather large index. >>>> >>>> >>>> karl >>>> >>>> 13 feb 2009 kl. 10.43 skrev d-fader: >>>> >>>> Karl, >>>> >>>> >>>>> As a matter of fact I more or less did. I'm not really into NGrams, but >>>>> I >>>>> read some articles about this technique and I eventually ended up at >>>>> the >>>>> 'Did you mean: Lucene?' article written by Tom White. To make a long >>>>> story >>>>> short, this solved my problem partially. I do have 2 indexes now and >>>>> I've >>>>> written code to extract all terms a user entered, put them through the >>>>> suggestion engine and tries to be clever about what suggestion should >>>>> be >>>>> used. It includes that stop words are ignored, when the entered term >>>>> exists >>>>> for more than x times in the index already it's probably good (and thus >>>>> a >>>>> suggestion is not needed) and when there are suggestions available, the >>>>> suggestion with the most occurences in the index is presented. After >>>>> that >>>>> the original query is being built up again, preserving all command >>>>> codes >>>>> (like ", ( ), AND, OR, etc. etc.). >>>>> As said, this system works pretty well and mostly if there's a >>>>> suggestion >>>>> available, it's actually quite accurate, so thanks for this. >>>>> >>>>> Still, it doesn't solve my problem fully. But I think I now know why >>>>> Lucene can't search 'truely' partially. To find a document fast, all >>>>> terms >>>>> are stored with a list of documents which contain the term and when a >>>>> user >>>>> searches, Lucene can identify the documents by comparing the terms >>>>> entered >>>>> to the terms on that list, right? If so, it's understandable that a >>>>> true >>>>> partial search never will work, but then I just don't understand how >>>>> Google >>>>> manages to do this :) >>>>> >>>>> Jori. >>>>> >>>>> >>>>> >>>>> >>>>> Karl Wettin wrote: >>>>> >>>>> >>>>> >>>>>> Hi again Jori, >>>>>> >>>>>> did you try N-grams as suggested in the reply on -dev? >>>>>> >>>>>> >>>>>> karl >>>>>> >>>>>> 13 feb 2009 kl. 09.05 skrev d-fader: >>>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>>> I've actually posted this message in de dev mailing list earlier, >>>>>>> because I though my 'issue' is a limitation of the functionality of >>>>>>> Lucene, but they redirected me to this mailinglist, so I hope one of >>>>>>> you >>>>>>> guys can help me out :) >>>>>>> >>>>>>> Maybe the 'issue' I'm addressing now is discussed thouroughly >>>>>>> already, >>>>>>> in that case I think I need some redirection to the sources of those >>>>>>> discussions :) Anyway, here's the thing. >>>>>>> For all I know it's impossible to search partial words with Lucene >>>>>>> (except the asterix method with e.g. the StandardAnalyzer -> ambul* >>>>>>> to >>>>>>> find ambulance). My problem with that method is that my index >>>>>>> consists >>>>>>> of quite a few terms. This means that if a user would search for >>>>>>> 'ambu >>>>>>> amster' (ambulance amsterdam), there will be so many terms to search, >>>>>>> the waiting time is just inacceptable. Now I started thinking why >>>>>>> it's >>>>>>> impossible to search only a 'part' of a term or even only the 'start' >>>>>>> of >>>>>>> a term and the only reason I could think of was that the Index terms >>>>>>> are >>>>>>> stored tokenized (in that way you (of course) can't find partial >>>>>>> terms, >>>>>>> since the index doesn't actually contain the literal terms, but >>>>>>> tokens >>>>>>> instead). But Lucene can also store all terms untokenized, so in that >>>>>>> case, in my humble opinion, a partial search would be possible, since >>>>>>> all terms would be stored 'literally'. >>>>>>> >>>>>>> Maybe my thinking is wrong, I only have a black box view of Lucene, >>>>>>> so >>>>>>> I >>>>>>> don't know much about indexing algorithm and all, but I just want to >>>>>>> know if this could be done or else why not :) You see, the users of >>>>>>> my >>>>>>> index want to know why they can't search parts of the words they >>>>>>> enter >>>>>>> and I still can't give them a really good answer, except the 'it >>>>>>> would >>>>>>> result in too many OR operators in the query' statement :) . I've >>>>>>> tried >>>>>>> using a Dutch stemmer (most of the data I'm indexing is Dutch) but >>>>>>> that >>>>>>> didn't work out quite good. Furthermore users sometimes search for a >>>>>>> certain 'filename' and mostly they just enter a part of the name and >>>>>>> thus don't find anything. >>>>>>> >>>>>>> I hope someone can enlighten me :) Thanks in advance! >>>>>>> >>>>>>> Jori >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com