On Tue, Jul 17, 2012 at 12:44 PM, Roman Chyla <[email protected]> wrote:
> Hi,
>
> Tests show that TermEnum.docFreq() returns sum of all docs, including
> the deleted ones. Which seems to (indirectly) contradict the javadoc
That's right; fixing it to reflect deleted documents would be
prohibitively costly.
Hmm which version/javadocs are you looking at? IndexReader.docFreq at
least calls out this limitation.
> This frequency count is used to compute uninverted index
> (DocTermOrds.uninvert()). The code goes like:
>
> final int df = te.docFreq();
> if (df <= maxTermDocFreq) {
>
>
> So, if I happen to have many deleted documents, and maxTermDocFreq is
> low, then the term will be excluded (even if the freq of the livedocs
> is OK). Most likely, the cache will be incomplete.
>
> Can it be considered a feature? Or is it a bug?
Maybe we could pro-rate the return docFreq by the pctg of deleted
documents? It wouldn't be perfectly correct but on average should
have the right effect (keeping RAM consumption down)?
Can you open a Jira issue? Thanks.
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]