[ https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668466#action_12668466 ]
Michael McCandless commented on LUCENE-1487: -------------------------------------------- How about this: {code} /** * A {...@link Filter} that only accepts documents whose single * term value in the specified field is contained in the * provided set of allowed terms. * * <p/> * * This is the same functionality as TermsFilter (from * contrib/queries), except this filter requires that the * field contains only a single term for all documents. * Because of drastically different implementations, they * also have different performance characteristics, as * described below. * * <p/> * * The first invocation of this filter on a given field will * be slower, since a {...@link FieldCache.StringIndex} must be * created. Subsequent invocations using the same field * will re-use this cache. However, as with all * functionality based on {...@link FieldCache}, persistent RAM * is consumed to hold the cache, and is not freed until the * {...@link IndexReader} is closed. In contrast, TermsFilter * has no persistent RAM consumption. * * * <p/> * * With each search, this filter translates the specified * set of Terms into a private {...@link OpenBitSet} keyed by * term number per unique {...@link IndexReader} (normally one * reader per segment). Then, during matching, the term * number for each docID is retrieved from the cache and * then checked for inclusion using the {...@link OpenBitSet}. * Since all testing is done using RAM resident data * structures, performance should be very fast, most likely * fast enough to not require further caching of the * DocIdSet for each possible combination of terms. * However, because docIDs are simply scanned linearly, an * index with a great many small documents may find this * linear scan too costly. * * <p/> * * In contrast, TermsFilter builds up an {...@link OpenBitSet}, * keyed by docID, every time it's created, by enumerating * through all matching docs using {...@link TermDocs} to seek * and scan through each term's docID list. While there is * no linear scan of all docIDs, besides the allocation of * the underlying array in the {...@link OpenBitSet}, this * approach requires a number of "disk seeks" in proportion * to the number of terms, which can be exceptionally costly * when there are cache misses in the OS's IO cache. * * <p/> * * Generally, this filter will be slower on the first * invocation for a given field, but subsequent invocations, * even if you change the allowed set of Terms, should be * faster than TermsFilter, especially as the number of * Terms being matched increases. If you are matching only * a very small number of terms, and those terms in turn * match a very small number of documents, TermsFilter may * perform faster. * * <p/> * * Which filter is best is very application dependent. */ {code} > FieldCacheTermsFilter > --------------------- > > Key: LUCENE-1487 > URL: https://issues.apache.org/jira/browse/LUCENE-1487 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Affects Versions: 2.4 > Reporter: Tim Sturge > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: FieldCacheTermsFilter.java, FieldCacheTermsFilter.java, > LUCENE-1487.patch > > > This is a companion to FieldCacheRangeFilter except it operates on a set of > terms rather than a range. It works best when the set is comparatively large > or the terms are comparatively common. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org