[ https://issues.apache.org/jira/browse/SOLR-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833495#action_12833495 ]
Koji Sekiguchi edited comment on SOLR-1773 at 2/14/10 4:51 AM: --------------------------------------------------------------- Random comment on the patch: - TimeAllowed not supported - cache not supported - distributed search is not supported - sort field is hard-coded in the patch - collapse.type=adjacent is not supported - collapse.aggregate is not supported (but supportable) - not yet, but collapse.sort can be supported to specify sort criteria in collapse group supported parameters: |collapse|set to on to use field collapsing| |collapse.field|field name to collapse (required)| |collapse.limit|maximum number of collapsed docs to return in each collapse group| |collapse.fl|comma- or space- delimited list of fields to return| was (Author: koji): Random comment on the patch: - TimeAllowed not supported - cache not supported - distributed search is not supported - sort field is hard-coded in the patch - collapse.type=adjacent is not supported - collapse.aggregate is not supported (but supportable) - not yet, but collapse.sort can be supported supported parameters: |collapse|set to on to use field collapsing| |collapse.field|field name to collapse (required)| |collapse.limit|maximum number of collapsed docs to return in each collapse group| |collapse.fl|comma- or space- delimited list of fields to return| > Field Collapsing (lightweight version) > -------------------------------------- > > Key: SOLR-1773 > URL: https://issues.apache.org/jira/browse/SOLR-1773 > Project: Solr > Issue Type: New Feature > Components: search > Affects Versions: 1.4 > Reporter: Koji Sekiguchi > Priority: Minor > Attachments: LOADTEST.patch, SOLR-1773.patch > > > I'd like to start another approach for field collapsing suggested by Yonik on > 19/Dec/09 at SOLR-236. Re-posting the idea: > {code} > =================== two pass collapsing algorithm for collapse.aggregate=max > ==================== > First pass: pretend that collapseCount=1 > - Use a TreeSet as a priority queue since one can remove and insert > entries. > - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to > top entry in the TreeSet > - compare new doc with smallest element in treeset. If smaller discard and > go to the next doc. > - If new doc is bigger, look up it's group. Use the Map to find if the > group has been added to the TreeSet and add it if not. > - If the new bigger doc is already in the TreeSet, compare with the > document in that group. If bigger, update the node, > remove and re-add to the TreeSet to re-sort. > efficiency: the treeset and hashmap are both only the size of the top number > of docs we are looking at (10 for instance) > We will now have the top 10 documents collapsed by the right field with a > collapseCount of 1. Put another way, we have the top 10 groups. > Second pass (if collapseCount>1): > - create a priority queue for each group (10) of size collapseCount > - re-execute the query (or if the sort within the collapse groups does not > involve score, we could just use the docids gathered during phase 1) > - for each document, find it's appropriate priority queue and insert > - optimization: we can use the previous info from phase1 to even avoid > creating a priority queue if no other items matched. > So instead of creating collapse groups for every group in the set (as is done > now?), we create it for only 10 groups. > Instead of collecting the score for every document in the set (40MB per > request for a 10M doc index is *big*) we re-execute the query if needed. > We could optionally store the score as is done now... but I bet aggregate > throughput on large indexes would be better by just re-executing. > Other thought: we could also cache the first phase in the query cache which > would allow one to quickly move to the 2nd phase for any collapseCount. > {code} > The restriction is: > {quote} > one would not be able to tell the total number of collapsed docs, or the > total number of hits (or the DocSet) after collapsing. So only > collapse.facet=before would be supported. > {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.