Varun Thacker created SOLR-9978:
-----------------------------------

             Summary: Reduce collapse query memory usage
                 Key: SOLR-9978
                 URL: https://issues.apache.org/jira/browse/SOLR-9978
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Varun Thacker
            Assignee: Varun Thacker


- Single shard test with one replica 
- 10M documents and 9M of those documents are unique. Test was for string
- Collapse query parser creates two arrays :
  - int array for unique documents ( 9M in this case )
  - float array for the corresponding scores ( 9M in this case )
- It goes through all documents and puts the document in the array if the score 
is better than the previously existing score.
- So collapse creates a lot of garbage when the total number of documents is 
high and the duplicates is very less
- Even for a query like this {{q={!cache=false}*:*&fq={!collapse 
field=collapseField_s cache=false}&sort=id desc}}
  which has a top level sort , the collapse query parser creates the score 
array and scores every document


Indexing script used to generate dummy data:
{code}
    //Index 10M documents , with every 1/10 document as a duplicate.
    List<SolrInputDocument> docs = new ArrayList<>(1000);
    for(int i=0; i<1000*1000*10; i++) {
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField("id", i);
      if (i%10 ==0 && i!=0) {
        doc.addField("collapseField_s", i-1);
      } else {
        doc.addField("collapseField_s", i);
      }
      docs.add(doc);
      if (docs.size() == 1000) {
        client.add("ct", docs);
        docs.clear();
      }
    }
    client.commit("ct");
{code}

Query:
{{q={!cache=false}*:*&fq={!collapse field=collapseField_s cache=false}&sort=id 
desc}}

Improvements
- We currently default to the SCORE implementation if no min|max|sort param is 
provided in the collapse query. Check if a global sort is provided and don't 
score documents picking the first occurrence of each unique value.
- Instead of creating an array for unique documents use a bitset






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to