AW: Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot faceting requests with facet.pivot.ngroup=true and facet.pivot.showLastList=false

2013-07-26 Thread Sandro Zbinden
Hey Erick

Thank you very much for your help.

So I dived into the solr code and read the  
http://wiki.apache.org/solr/HowToContribute section. Really informative :-)

I created a Jira issue about my problem and I attached a patch file with a 
implementation off pivot faceting with ngroup and visible 

Here is the link to the Jira Task 

https://issues.apache.org/jira/browse/SOLR-5079

Best Regards Sandro


-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Sonntag, 21. Juli 2013 14:59
An: solr-user@lucene.apache.org
Betreff: Re: Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot 
faceting requests with facet.pivot.ngroup=true and 
facet.pivot.showLastList=false

Sorry, life's been really hectic lately. I don't know the pivot code, so can't 
make much of a comment on that. But when it comes to code changes, it's 
perfectly reasonable to open up a JIRA and attach the code as a patch. You 
might have to nudge people a bit to get them to carry it forward...

The case will be strengthened if you can say that all the tests pass with your 
patch. If the tests don't pass, then it may point to issues with your patch, 
take a quick look at the tests that fail and see if they're related to your 
changes.

Start here:
http://wiki.apache.org/solr/HowToContribute

Best
Erick

On Fri, Jul 19, 2013 at 9:25 AM, Sandro Zbinden zbin...@imagic.ch wrote:
 Dear Members.

 Do you guys think I am better off in the solr developer group with this 
 question.

 To summarize I would like to add a facet.pivot.ngroup =true param for 
 show the count of the facet list Further on I would like to avoid an out of 
 memory exceptions in reducing the result of a facet.pivot query.

 Best Regards

 Sandro Zbinden


 -Ursprüngliche Nachricht-
 Von: Sandro Zbinden [mailto:zbin...@imagic.ch]
 Gesendet: Mittwoch, 17. Juli 2013 13:45
 An: solr-user@lucene.apache.org
 Betreff: Avoid Solr Pivot Faceting Out of Memory / Shorter result for 
 pivot faceting requests with facet.pivot.ngroup=true and 
 facet.pivot.showLastList=false

 Dear Usergroup


 I am getting an out of memory exception in the following scenario.
 I have 4 sql tables: patient, visit, study and image that will be 
 denormalized for the solr index The solr index looks like the 
 following


 
 |p_id |p_lastname|v_id  |v_name  |...
 
 | 1  | Miller| 10 | Study 1   |...
 | 2  | Miller| 11 | Study 2   |...
 | 2  | Miller| 12 | Study 3   |...  -- Duplication 
 because of denormalization
 | 3  | Smith| 13 | Study 4  |...
 --

 Now I am executing a facet query

 q=*:*facet=true facet.pivot=p_lastname,p_id facet.limit=-1

 And I get the following result

 lst
 str name=fieldp_lastname/str
 str name=valueMiller/str
 int name=count3/int
 arr name=pivot
   lst
str name=fieldp_id/str
int name=value1/int
int name=count1/int
   /lst
   lst
str name=fieldp_id/str
int name=value2/int
int name=count2/int
   /lst
 /arr
 /lst
 lst
 str name=fieldp_lastname/str
 str name=valueSmith/str
 int name=count1/int
 arr name=pivot
str name=fieldp_id/str
int name=value3/int
int name=count1/int
   /lst
 /arr
 /lst


 The goal is to show our clients a list of the group value and in parentheses 
 how many patients the group contains.
  - Miller (2)
 - Smith (1)

 This is why we need to use the facet.pivot method with facet.limit-1. It is 
 as far as I know the only way to get a grouping for 2 criterias.
 And we need the pivot list to count how many patients are in a group.


 Currently this works good on smaller indexes but if we have arround 1'000'000 
 patients and we execute a query like the one above we run in an out of memory.
 I figured out that the problem is not the calculation of the pivot but is the 
 presentation of the result.
 Because we load all fields (we can not us facet.offset because we need to 
 order the results ascending and descending) the result can get really big.

 To avoid this overload I created a change in the solr-core 
 PivotFacetHandler.java class.
 In the method doPivots i added the following code

NamedListInteger nl = this.getTermCounts(subField);
pivot.add( ngroups, nl.size());

 This will give me the group size of the list.
 Then I removed the recursion call pivot.add( pivot, doPivots( nl, 
 subField, nextField, fnames, subset) ); Like this my result looks like 
 the following

 lst
 str name=fieldp_lastname/str
 str name=valueMiller/str
 int name=count3/int
 int name=ngroup2/int
 /lst
 lst
 str name=fieldp_lastname/str
 str name=valueSmith/str
 int name=count1/int
 int name=ngroup1/int
 /lst


 My questions is now if there is already something planned like 
 facet.pivot.ngroup=true and facet.pivot.showLastList=false to improve the 
 performance of pivot faceting.

 Is 

AW: Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot faceting requests with facet.pivot.ngroup=true and facet.pivot.showLastList=false

2013-07-19 Thread Sandro Zbinden
Dear Members.

Do you guys think I am better off in the solr developer group with this 
question. 

To summarize I would like to add a facet.pivot.ngroup =true param for show the 
count of the facet list
Further on I would like to avoid an out of memory exceptions in reducing the 
result of a facet.pivot query.

Best Regards 

Sandro Zbinden


-Ursprüngliche Nachricht-
Von: Sandro Zbinden [mailto:zbin...@imagic.ch] 
Gesendet: Mittwoch, 17. Juli 2013 13:45
An: solr-user@lucene.apache.org
Betreff: Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot 
faceting requests with facet.pivot.ngroup=true and 
facet.pivot.showLastList=false

Dear Usergroup


I am getting an out of memory exception in the following scenario.
I have 4 sql tables: patient, visit, study and image that will be denormalized 
for the solr index The solr index looks like the following



|p_id |p_lastname|v_id  |v_name  |...

| 1  | Miller| 10 | Study 1   |...
| 2  | Miller| 11 | Study 2   |...
| 2  | Miller| 12 | Study 3   |...  -- Duplication because 
of denormalization
| 3  | Smith| 13 | Study 4  |...
--

Now I am executing a facet query

q=*:*facet=true facet.pivot=p_lastname,p_id facet.limit=-1

And I get the following result

lst
str name=fieldp_lastname/str
str name=valueMiller/str
int name=count3/int
arr name=pivot
  lst
   str name=fieldp_id/str
   int name=value1/int
   int name=count1/int
  /lst
  lst
   str name=fieldp_id/str
   int name=value2/int
   int name=count2/int
  /lst
/arr
/lst
lst
str name=fieldp_lastname/str
str name=valueSmith/str
int name=count1/int
arr name=pivot
   str name=fieldp_id/str
   int name=value3/int
   int name=count1/int
  /lst
/arr
/lst


The goal is to show our clients a list of the group value and in parentheses 
how many patients the group contains.
 - Miller (2)
- Smith (1)

This is why we need to use the facet.pivot method with facet.limit-1. It is as 
far as I know the only way to get a grouping for 2 criterias.
And we need the pivot list to count how many patients are in a group.


Currently this works good on smaller indexes but if we have arround 1'000'000 
patients and we execute a query like the one above we run in an out of memory.
I figured out that the problem is not the calculation of the pivot but is the 
presentation of the result.
Because we load all fields (we can not us facet.offset because we need to order 
the results ascending and descending) the result can get really big.

To avoid this overload I created a change in the solr-core 
PivotFacetHandler.java class.
In the method doPivots i added the following code

   NamedListInteger nl = this.getTermCounts(subField);
   pivot.add( ngroups, nl.size());

This will give me the group size of the list.
Then I removed the recursion call pivot.add( pivot, doPivots( nl, subField, 
nextField, fnames, subset) ); Like this my result looks like the following

lst
str name=fieldp_lastname/str
str name=valueMiller/str
int name=count3/int
int name=ngroup2/int
/lst
lst
str name=fieldp_lastname/str
str name=valueSmith/str
int name=count1/int
int name=ngroup1/int
/lst


My questions is now if there is already something planned like 
facet.pivot.ngroup=true and facet.pivot.showLastList=false to improve the 
performance of pivot faceting.

Is there a chance we could get this into the solr code. I think it's a really 
small change of the code but could improve the product enormous.

Best Regards

Sandro Zbinden