[ 
https://issues.apache.org/jira/browse/SOLR-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234092#comment-15234092
 ] 

Yonik Seeley commented on SOLR-8922:
------------------------------------

bq. I thought this was just about saving memory. Why is it faster? Less GC time?

The fact that any difference can be seen at all is that I made docset 
generation a bottleneck in the test.  So the benchmark itself is certainly not 
typical, even if it is real. And yes, I expect that it's all due to GC... but 
it's hard to prove.

bq. > 20% chance of a document missing the value for a field.
bc. Put another way, do you mean any given term has an 80% chance of being in 
the doc?

No, an 80% chance having a *value* for the field.  The chance for having "any 
given term" would be 80%/nterms.

bq. I'm confused why the number of terms that are in the field has anything to 
do with the performance of this patch.

I'm leveraging pre-existing indexes and tools to test this.  Using the 
different fields with different doc freqs for the terms was an easy way to vary 
the number of docs collected.  100 unique values in the field means a single 
matching term query on that field will match 10M docs * .80 / 100, or 80K docs.


> DocSetCollector can allocate massive garbage on large indexes
> -------------------------------------------------------------
>
>                 Key: SOLR-8922
>                 URL: https://issues.apache.org/jira/browse/SOLR-8922
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Jeff Wartes
>            Assignee: Yonik Seeley
>         Attachments: SOLR-8922.patch, SOLR-8922.patch
>
>
> After reaching a point of diminishing returns tuning the GC collector, I 
> decided to take a look at where the garbage was coming from. To my surprise, 
> it turned out that for my index and query set, almost 60% of the garbage was 
> coming from this single line:
> https://github.com/apache/lucene-solr/blob/94c04237cce44cac1e40e1b8b6ee6a6addc001a5/solr/core/src/java/org/apache/solr/search/DocSetCollector.java#L49
> This is due to the simple fact that I have 86M documents in my shards. 
> Allocating a scratch array big enough to track a result set 1/64th of my 
> index (1.3M) is also almost certainly excessive, considering my 99.9th 
> percentile hit count is less than 56k.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to