[ https://issues.apache.org/jira/browse/SOLR-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859846#comment-15859846 ]
Varun Thacker commented on SOLR-9978: ------------------------------------- Quick benchmark with the latest patch. 10 million documents were indexed with 1 in 10 documents having the same collapse value. 50 queries were run on this index and the freed memory in gc viewer was recorded. ||Query||Freed Memory|| |\{!collapse field=collapseField_s cache=false\}&sort=id desc|848 MB| |\{!collapse field=collapseField_s cache=false\}|4385 MB| |\{!collapse field=collapseField_ti cache=false\}&sort=id desc|5062 MB| |\{!collapse field=collapseField_ti cache=false\}|9408 MB| > Reduce collapse query memory usage > ---------------------------------- > > Key: SOLR-9978 > URL: https://issues.apache.org/jira/browse/SOLR-9978 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Varun Thacker > Assignee: Varun Thacker > Attachments: SOLR-9978.patch, SOLR-9978.patch > > > - Single shard test with one replica > - 10M documents and 9M of those documents are unique. Test was for string > - Collapse query parser creates two arrays : > - int array for unique documents ( 9M in this case ) > - float array for the corresponding scores ( 9M in this case ) > - It goes through all documents and puts the document in the array if the > score is better than the previously existing score. > - So collapse creates a lot of garbage when the total number of documents is > high and the duplicates is very less > - Even for a query like this {{q={!cache=false}*:*&fq={!collapse > field=collapseField_s cache=false}&sort=id desc}} > which has a top level sort , the collapse query parser creates the score > array and scores every document > Indexing script used to generate dummy data: > {code} > //Index 10M documents , with every 1/10 document as a duplicate. > List<SolrInputDocument> docs = new ArrayList<>(1000); > for(int i=0; i<1000*1000*10; i++) { > SolrInputDocument doc = new SolrInputDocument(); > doc.addField("id", i); > if (i%10 ==0 && i!=0) { > doc.addField("collapseField_s", i-1); > } else { > doc.addField("collapseField_s", i); > } > docs.add(doc); > if (docs.size() == 1000) { > client.add("ct", docs); > docs.clear(); > } > } > client.commit("ct"); > {code} > Query: > {{q=\{!cache=false\}*:*&fq=\{!collapse field=collapseField_s > cache=false\}&sort=id desc}} > Improvements > - We currently default to the SCORE implementation if no min|max|sort param > is provided in the collapse query. Check if a global sort is provided and > don't score documents picking the first occurrence of each unique value. > - Instead of creating an array for unique documents use a bitset -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org