pseudo-field-collapsing
-----------------------

                 Key: SOLR-1311
                 URL: https://issues.apache.org/jira/browse/SOLR-1311
             Project: Solr
          Issue Type: New Feature
          Components: search
    Affects Versions: 1.4
            Reporter: Marc Sturlese
             Fix For: 1.5


I am trying to develope a new way of doing field collapsing based on the 
adjacent field collapsing algorithm. I have started developing it beacuse I am 
experiencing performance problems with the field collapsing patch with big 
index (8G).
The algorith does adjacent-pseudo-field collapsing. It does collapsing on the 
first X documents. Instead of making the collapsed docs disapear, the algorith 
will send them to a given position of the relevance results list.
The reason I just do collapsing in the first X documents is that if I have for 
example 600000 results and I am showing 10 results per page, I really don't 
need to do collapsing in the page 30000 or even not in the 3000. Doing this I 
am noticing dramatically better performance. The problem is I couldn't find a 
way to plug the algorithm as a component and keep good performance. I had to 
hack few classes in SolrIndexSearcher.java
This patch is just experimental and for testing purposes. In case someone finds 
it interesting would be good do find a way to integrate it in a better way than 
it is at the moment.
Advices are more than welcome.

        
Functionality:
In solrconfig.xml we specify the pseudo-collapsing parameters:
     <str name="plus.considerMoreDocs">true</str>
     <str name="plus.considerHowMany">3000</str>
     <str name="plus.considerField">name</str>
(at the moment there's no threshold and other parameters that exist in the 
current collapse-field patch)

plus.considerMoreDocs one enables pseudo-collapsing
plus.considerHowMany sets the number of resultant documents in wich we want to 
apply the algorithm
plus.considerField is the field to do pseudo-collapsing

If the number of results is lower than plus.considerHowMany the algorithm will 
be applyed to all the results.
Let's say there is a query with 600000 results and we've set considerHowMany to 
3000 (and we already have the docs sorted by relevance). 
What adjacent-pseudo-collapse does is, if the 2nd doc has to be collapsed it 
will be sent to the pos 2999 of the relevance results array. If the 3th has to 
be collpased too  will go to the position 2998 and successively like this.

The algorithm is not applyed when a sortspec is set or plus.considerMoreDocs is 
set to false. It neighter is applyed when using MoreLikeThisRequestHanlder.

Example with a query of 9 results:
Results sorted by relevance without pseudo-collapse-algorithm:

doc1 - collapse_field_value 3
doc2 - collapse_field_value 3
doc3 - collapse_field_value 4
doc4 - collapse_field_value 7
doc5 - collapse_field_value 6
doc6 - collapse_field_value 6
doc7 - collapse_field_value 5
doc8 - collapse_field_value 1
doc9 - collapse_field_value 2

Results pseudo-collapsed with plus.considerHowMany = 5

doc1 - collapse_field_value 3
doc3 - collapse_field_value 4
doc4 - collapse_field_value 7
doc5 - collapse_field_value 6
doc2 - collapse_field_value 3*
doc6 - collapse_field_value 6
doc7 - collapse_field_value 5
doc8 - collapse_field_value 1
doc9 - collapse_field_value 2

Results pseudo-collapsed with plus.considerHowMany = 9

doc1 - collapse_field_value 3
doc3 - collapse_field_value 4
doc4 - collapse_field_value 7
doc5 - collapse_field_value 6
doc7 - collapse_field_value 5
doc8 - collapse_field_value 1
doc9 - collapse_field_value 2
doc6 - collapse_field_value 6*
doc2 - collapse_field_value 3*

*pseudo-collapsed documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to