Thank you, Erick, this is very useful! Have you ever taken a look at Google Suggest[1]? It's very fast, and the results are impressive. I think your suggestion will go a long way to fixing my problem, but there's probably still quite a gap between this approach and the kind of results that Google Suggest provides.
I wonder how it could be possible to do the same with Lucene... Anyway, thanks a lot for the help! [1] http://www.google.com/webhp?complete=1&hl=en On Tue, 2007-05-15 at 12:46 -0400, Erick Erickson wrote: > OK, I'm going to go on the assumption that all you're interested in > is auto completion, so don't try to generalized this to queries...... > > Don't use queries, PrefixQuery or otherwise. Use one of the > TermEnums, probably WildcardTermEnum. What that will do is > allow you to find all the terms, in lexical order, that match > your fragment without using queries at all. This has several > advantages.... > > 1> it's fast. It doesn't require you to do anything but march down > some index list. > 2> it doesn't expand the terms prior to processing. No such thing > as "TooManyClauses". Perhaps OutOfMemory if you try to > return 100,000,000 terms <G>.... > 3> you can stop whenever you've accumulated "enough" terms, > where "enough" is up to you. > > NOTE: there's also a RegexTermEnum in the > contrib section (last I knew, it may be in core in 2.1) that > allows arbitrary regex enumerations, but it's significantly > slower than WildcardTermEnum, which is hardly surprising > since it has to do more work....... > > It's reasonable to ask how much use auto-completion is when the > poor user has, say, 10,000 terms to choose from, so I think it's > entirely reasonable to get the first, say, 100 terms and quit. You > should be able to do something like this quite easily with the Enums. > > I think your original solution of using queries would not be > satisfactory for the user anyway, *assuming* that the user is > as impatient as I am and wants some auto-complete options > RIGHT NOW <G>, even if you solved the TooManyClauses > issue. > > Along the same lines, another question is whether you should > try to auto-complete when the user has typed less than, say, > 3 characters, but that's your design decision..... > > Really, try the WildcardTermEnum. It's pretty neat. > > Hope this helps > Erick > > On 5/14/07, David Leangen <[EMAIL PROTECTED]> wrote: > > Thank you very much for this. Some more questions inline... > > > > > > - How can I limit the number of hits? I don't know > in > > advance what the data will be, so it's not > feasible for > > me to use RangeQuery. > > > > > > You can use a TopDocs or a HitCollector object which allows > you > > to process each object as it's hit. But I doubt you need to > do this. > > > No. I expect you're using a wildcard, and wildcard handling > is > > complicated. > > > Ok, you're right. It's not the limiting of the results that's > the > problem, it's the way the search is expanded. > > Since this is an autocomplete, when the user types, for > example "a" or a > Japanese character "あ", I am using PrefixFilter for this, so > I guess > the search turns into "a*" and "あ*" respectively. > > In the archive, the related posts I read either refer to a > DateRange > (where it is possible to search first by year, then month... > etc.), or > they suggest to increase the max count. > > Neither of these solutions work in my case... It's not a date, > and I > have no idea of the results in advance and it would not be > practical or > elegant to speculate on the results (for example first try > aa*~ab* and > see what that gives, etc.). > > I can get access to the "weight" values of the terms (a data > field > determined by their frequency of use), so I'll try something > related to > that. For people with more experience, would that be a good > path to > take? > > Otherwise, would a reasonable solution be to override or > re-implement > PrefixFilter? > > > Thank you so much! > David > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]