[
https://issues.apache.org/jira/browse/SOLR-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Varun Thacker updated SOLR-9978:
--------------------------------
Attachment: SOLR-9978.patch
Patch detects if we are using a top level sort. For this case we don't score
documents and also use a bitset to mark the documents been collected. This way
we don't allocate the 9M int and float arrays and just have a bitset of 9M
Here is a test for 100 queries. This is for string field.
100Q FreedMemory FreedMemory_ForceGC
With Top Level Sort : 525M 849M
Without Top Level Sort : 3885M 4345M
TODOs:
- Benchmark Int performance to see how much reduction do we see there. I
suspect it will be by ~50% and not more. Strings worked a lot better because we
get ordinals and not the actual values allowing us to use a bitset. For int we
needed a IntHashSet
- See for performance slowdowns. OrdScoreCollector#finish is slower with the
patch when needsScore=false .
- Tests!
> Reduce collapse query memory usage
> ----------------------------------
>
> Key: SOLR-9978
> URL: https://issues.apache.org/jira/browse/SOLR-9978
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Attachments: SOLR-9978.patch
>
>
> - Single shard test with one replica
> - 10M documents and 9M of those documents are unique. Test was for string
> - Collapse query parser creates two arrays :
> - int array for unique documents ( 9M in this case )
> - float array for the corresponding scores ( 9M in this case )
> - It goes through all documents and puts the document in the array if the
> score is better than the previously existing score.
> - So collapse creates a lot of garbage when the total number of documents is
> high and the duplicates is very less
> - Even for a query like this {{q={!cache=false}*:*&fq={!collapse
> field=collapseField_s cache=false}&sort=id desc}}
> which has a top level sort , the collapse query parser creates the score
> array and scores every document
> Indexing script used to generate dummy data:
> {code}
> //Index 10M documents , with every 1/10 document as a duplicate.
> List<SolrInputDocument> docs = new ArrayList<>(1000);
> for(int i=0; i<1000*1000*10; i++) {
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField("id", i);
> if (i%10 ==0 && i!=0) {
> doc.addField("collapseField_s", i-1);
> } else {
> doc.addField("collapseField_s", i);
> }
> docs.add(doc);
> if (docs.size() == 1000) {
> client.add("ct", docs);
> docs.clear();
> }
> }
> client.commit("ct");
> {code}
> Query:
> {{q=\{!cache=false\}*:*&fq=\{!collapse field=collapseField_s
> cache=false\}&sort=id desc}}
> Improvements
> - We currently default to the SCORE implementation if no min|max|sort param
> is provided in the collapse query. Check if a global sort is provided and
> don't score documents picking the first occurrence of each unique value.
> - Instead of creating an array for unique documents use a bitset
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]