> What is the best way to efficiently convert that list of primary keys to
> Lucene docIds.
Avoid disk seeks. Lucene is fast but still beholden to the laws of physics.
Random disk seeks will cost you eg. 50,000 * 5ms =250 seconds (minus any
effects of OS disk caching).
Best way to handle this lookup is a PK-docid cache which can be reused for all
users. Since 2.9 Lucene holds caches e.g. FieldCache down at segment level so a
commit or merge should only invalidate a subset of cached items. Trouble is I
think FieldCache is for docID->FieldValue lookups whereas you want a cache that
works the other way around.
Cheers
Mark
>
> I was looking at the Lucene in Action example code (which was not designed
> for this use case) where the Lucene docId is retrieved by iteratively calling
> termDocs.read. How expensive is this operation? Would 50,000 calls return in
> a few seconds or less?
>
> for (String isbn : isbns) {
> if (isbn != null) {
> TermDocs termDocs =
> reader.termDocs(new Term("isbn", isbn));
> int count = termDocs.read(docs, freqs);
> if (count == 1) {
> bits.set(docs[0]);
> }
>
>>> That could involve a lot of disk seeks unless you cache a pk->docid lookup
>>> in ram.
> That sounds interesting. How would the pk->docid lookup get populated?
> Wouldn't a pk->docid cache be invalidated with each commit or merge?
>
> Tom
>
> -----Original Message-----
> From: Mark Harwood [mailto:[email protected]]
> Sent: Friday, July 23, 2010 2:56 AM
> To: [email protected]
> Subject: Re: on-the-fly "filters" from docID lists
>
> Re scalability of filter construction - the database is likely to hold stable
> primary keys not lucene doc ids which are unstable in the face of updates.
> You therefore need a quick way of converting stable database keys read from
> the db into current lucene doc ids to create the filter. That could involve a
> lot of disk seeks unless you cache a pk->docid lookup in ram. You should use
> cachingwrapperfilter too to cache the computed user permissions from one
> search to the next.
> This can get messy. If the access permissions are centred around roles/groups
> it is normally faster to tag docs with these group names and query them with
> the list of roles the user holds.
> If individual user-doc-level perms are required you could also consider
> dynamically looking up perms for just the top n results being shown at the
> risk of needing to repeat the query with a larger n if insufficient matches
> pass the lookup.
>
> Cheers
> Mark
> ----------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]