On 11 Apr 2007 at 18:05, Erick Erickson wrote: > Rather than using a search, have you thought about using a TermEnum? > It's much, much, much faster than a query. What it allows you to do > is enumerate the terms in the index on a per-field basis. Essentially, this > is what happens when you do a PrefixQuery as BooleanClauses are > added, but you have very few options for restricting the returned list when > you use PrefixQuery. > As I'm still fresh with lucene I did not look into TermEnum yet. And yes, you are right. I already wondered how to possibly cut down on the returns of a prefix query.
... > What I have in mind is something like returning the first N terms > that match a particular prefix pattern. Even if you elect not to do this, > and return all the possibilities, this will be much faster than > executing a query. And won't run afoul of the TooManyClauses > exception, you'll only be restricted by available memory. Not to > mention simplifying your index over the bigram/trigram option <G>..... > If I understand correctly, you are suggesting to look up documents that match prefixes with TermDocs.seek(enum) separately, possibly restricting them by evaluation of doc boosts, etc. and then merging the remainders with the separate search results for the other tokens. Is that right? > BTW, you can alter the limit for returning the TooManyClauses option > by BooleanQuery.setMaxClauseCount, but I'd really recommend the > WildCardTermEnum approach first. Yes, that was the point where I thought that turning to the group would probably get me some better ideas ;-) > > Finally, your question about copying an index... it may not be easy. > Particularly if you have terms that are indexed but not stored, you > won't be able to reconstruct your documents exactly from the index.... Antony Bowesman came up with the PerFieldAnalyzerWrapper which would have prevented a need to copy. > > Best > Erick > Do you also have an idea for how to improve a fault tolerant search for the completed terms? The shortcomings are somewhat similar. Running each through a spell checker and adding results to a boolean query does not help with the performance. Besides, with lucene's standard spell checker I think that there is no way to influence the sorting of suggestions (because there is no criteria). And so the restriction to the first 4-10 suggestions is entirely random and might just miss out on the most appropriate one. I've tried the NGramTokenizer from the Action book (contributed by alias-i, now appearently in the LingPipe distribution) and it gives better results in that it returns suggestions based on the weight of the documents, but at a much bigger cost. Of disk space as well as memory as performance. BTW, my test data is ~ 1.5 million artist / song titles which I extracted from a CDDB dump. This data represents very well the typical applications that I have in mind: Lots of tiny documents with 2-3 indexed fields that allow for faceted search. (Possibly associated with some meta data each.) Ideally the system should scale well with heavy user loads. - Certainly not a simple task where every keystroke translates into a query for suggestions, but the existing implementations show that it can be done. Only that I start wondering if these are done with lucene and written in java. :-/ I presume that the need for scalability also forbids any sort of result caching with the lucene filter wrappers. Even a bitmap for millions of documents must add up to something substantial. An optimization of the retrieval is probably worth more than the additional overhead of a caching strategy can bring. More thoughts, anyone? Thank You. Cheers, Steffen --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
