On 11 Apr 2007 at 18:05, Erick Erickson wrote:
> Rather than using a search, have you thought about using a TermEnum?
> It's much, much, much faster than a query. What it allows you to do
> is enumerate the terms in the index on a per-field basis. Essentially, this
> is what happens when you do a PrefixQuery as BooleanClauses are
> added, but you have very few options for restricting the returned list when
> you use PrefixQuery.
> 
As I'm still fresh with lucene I did not look into TermEnum yet.
And yes, you are right. I already wondered how to possibly cut down 
on the returns of a prefix query.

...
> What I have in mind is something like returning the first N terms
> that match a particular prefix pattern. Even if you elect not to do this,
> and return all the possibilities, this will be much faster than
> executing a query. And won't run afoul of the TooManyClauses
> exception, you'll only be restricted by available memory. Not to
> mention simplifying your index over the bigram/trigram option <G>.....
> 
If I understand correctly, you are suggesting to look up documents 
that match prefixes with TermDocs.seek(enum) separately, possibly 
restricting them by evaluation of doc boosts, etc. and then merging 
the remainders with the separate search results for the other tokens.
Is that right?

> BTW, you can alter the limit for returning the TooManyClauses option
> by BooleanQuery.setMaxClauseCount, but I'd really recommend the
> WildCardTermEnum approach first.
Yes, that was the point where I thought that turning to the group 
would probably get me some better ideas ;-)

> 
> Finally, your question about copying an index... it may not be easy.
> Particularly if you have terms that are indexed but not stored, you
> won't be able to reconstruct your documents exactly from the index....
Antony Bowesman came up with the PerFieldAnalyzerWrapper which would 
have prevented a need to copy.

> 
> Best
> Erick
> 

Do you also have an idea for how to improve a fault tolerant search 
for the completed terms?
The shortcomings are somewhat similar.
Running each through a spell checker and adding results to a boolean 
query does not help with the performance. 
Besides, with lucene's standard spell checker I think that there is 
no way to influence the sorting of suggestions (because there is no 
criteria). And so the restriction to the first 4-10 suggestions is 
entirely random and might just miss out on the most appropriate one.

I've tried the NGramTokenizer from the Action book (contributed by 
alias-i, now appearently in the LingPipe distribution) and it gives 
better results in that it returns suggestions based on the weight of 
the documents, but at a much bigger cost. Of disk space as well as 
memory as performance.

BTW, my test data is ~ 1.5 million artist / song titles which I 
extracted from a CDDB dump.
This data represents very well the typical applications that I have 
in mind:
Lots of tiny documents with 2-3 indexed fields that allow for faceted 
search. (Possibly associated with some meta data each.)

Ideally the system should scale well with heavy user loads. - 
Certainly not a simple task where every keystroke translates into a 
query for suggestions, but the existing implementations show that it 
can be done. Only that I start wondering if these are done with 
lucene and written in java. :-/

I presume that the need for scalability also forbids any sort of 
result caching with the lucene filter wrappers. Even a bitmap for 
millions of documents must add up to something substantial.
An optimization of the retrieval is probably worth more than the 
additional overhead of a caching strategy can bring.

More thoughts, anyone?

Thank You.

Cheers, Steffen





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to