Re: Solr seems to reserve facet.limit results
Markus Jelsmawrote: > I tried the overrequest ratio/count and set them to 1.0/0 . Odd enough, > with these settings high facet.limit and extremely high facet.limit are > both up to twice as slow as with 1.5/10 settings. Not sure if it is the right explanation for your "extremely high facet.limit"-case, but here goes... The two phases in distributed simple String faceting in Solr are very different from each other: The first phase allocates a counter structure, iterates the query hits and increments the counters, then extracts the top-X facet terms and returns them. The second phase receives a list of facet terms to count. The terms are those that the shard did not deliver in phase 1. An example might help here: For phase 1, shard 1 returns [a:5 b:3 c:3], while shard 2 returns [d:2 e:2 c:1]. This is merged to [a:5 c:4 b:3]. Since shard 2 did not return counts for the terms a and b, these counts are requested from shard 2 in phase 2. In the current implementation, the term counts in the second phase are calculated in the same way as enum faceting: Basically one tiny search for each term with the query facetfield:term. This does not scale well, so it does not take many terms before phase 2 gets _slower_ than phase 1 (you can see for yourself in the solr.log). So we want to keep the number of phase 2 term-counts down, even if it means that phase 1 gets a bit slower. This is where over-requesting comes into play: The more you over-request, the slower phase 1 gets, but it also means that the chance of the merger having to ask for extra term-counts gets lower as they were probably returned in phase 1. I wrote a bit about the phenomena in https://sbdevel.wordpress.com/2014/09/11/even-sparse-faceting-is-limited/ - Toke Eskildsen
RE: Solr seems to reserve facet.limit results
Thanks Chris, Toke, I tried the overrequest ratio/count and set them to 1.0/0 . Odd enough, with these settings high facet.limit and extremely high facet.limit are both up to twice as slow as with 1.5/10 settings. Even successive calls don't seem to 'warm anything up`. Anyone with an explaination for this? This is counterintuitive, well to me at least. Thanks, Markus -Original message- > From:Chris Hostetter <hossman_luc...@fucit.org> > Sent: Tuesday 6th December 2016 1:47 > To: solr-user@lucene.apache.org > Subject: RE: Solr seems to reserve facet.limit results > > > > I think what you're seeing might be a result of the overrequesting done > in phase #1 of a distriuted facet query. > > The purpose of overrequesting is to mitigate the possibility of a > constraint which should be in the topN for the collection as a whole, but > just outside the topN on every shard -- so they never make it to the > second phase of the distributed calculation. > > The amount of overrequest is, by default, a multiplicitive function of the > user specified facet.limit with a fudge factor (IIRC: 10+(1.5*facet.limit)) > > If you're using an explicitly high facet.limit, you can try setting the > overrequets ratio/count to 1.0/0 respectively to force Solr to only > request the # of constraints you've specified from each shard, and then > aggregate them... > > https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO > https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT > > > > One side note related to the work around you suggested... > > : One simple solution, in my case would be, now just thinking of it, run > : the query with no facets and no rows, get the numFound, and set that as > : facet.limit for the actual query. > > ...that assumes that the number of facet constraints returned is limited > by the total number of documents matching the query -- in general there is > no such garuntee because of multivalued fields (or faceting on tokenized > fields), so this type of approach isn't a good idea as a generalized > solution > > > > -Hoss > http://www.lucidworks.com/ >
Re: Solr seems to reserve facet.limit results
On Mon, 2016-12-05 at 17:47 -0700, Chris Hostetter wrote: > : One simple solution, in my case would be, now just thinking of it, > : run the query with no facets and no rows, get the numFound, and set > : that as facet.limit for the actual query. > > ...that assumes that the number of facet constraints returned is > limited by the total number of documents matching the query -- in > general there is no such garuntee because of multivalued fields (or > faceting on tokenized fields), so this type of approach isn't a good > idea as a generalized solution For simple String/Text faceting, which Markus seems to be using, the number of repetitions of a term in a document does not matter: Each term only counts at most once per document. If there are any common case deviations from this, the preface to the faceting documentation should be updated: "...along with numerical counts of how many matching documents were found were each term". https://cwiki.apache.org/confluence/display/solr/Faceting - Toke Eskildsen, State and University Library, Denmark
RE: Solr seems to reserve facet.limit results
I think what you're seeing might be a result of the overrequesting done in phase #1 of a distriuted facet query. The purpose of overrequesting is to mitigate the possibility of a constraint which should be in the topN for the collection as a whole, but just outside the topN on every shard -- so they never make it to the second phase of the distributed calculation. The amount of overrequest is, by default, a multiplicitive function of the user specified facet.limit with a fudge factor (IIRC: 10+(1.5*facet.limit)) If you're using an explicitly high facet.limit, you can try setting the overrequets ratio/count to 1.0/0 respectively to force Solr to only request the # of constraints you've specified from each shard, and then aggregate them... https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_RATIO https://lucene.apache.org/solr/6_3_0/solr-solrj/org/apache/solr/common/params/FacetParams.html#FACET_OVERREQUEST_COUNT One side note related to the work around you suggested... : One simple solution, in my case would be, now just thinking of it, run : the query with no facets and no rows, get the numFound, and set that as : facet.limit for the actual query. ...that assumes that the number of facet constraints returned is limited by the total number of documents matching the query -- in general there is no such garuntee because of multivalued fields (or faceting on tokenized fields), so this type of approach isn't a good idea as a generalized solution -Hoss http://www.lucidworks.com/
Re: Solr seems to reserve facet.limit results
On Fri, 2016-12-02 at 12:17 +, Markus Jelsma wrote: > I have not considered streaming as i am still completely unfamiliar > with it and i don't yet know what problems it can solve. Standard faceting requires all nodes to produce their version of the full result and send it as one chunk, which is then merged at the calling node (+ other stuff). For large results that comes with a significant memory overhead. Solr streaming is ... well, streaming: With practically the same memory overhead if you request 10K or 10 billion entries. > One simple solution, in my case would be, now just thinking of it, > run the query with no facets and no rows, get the numFound, and set > that as facet.limit for the actual query. That would work with your solution. Still, try issuing a "*:*"-search and see if it breaks your very large facet request. > Are there any examples / articles about consuming streaming facets > with SolrJ? Sorry, I have little experience with SolrJ. - Toke Eskildsen, State and University Library, Denmark
RE: Solr seems to reserve facet.limit results
Hello Toke - this is one 6.3 (forgot to mention) and rows=0 and we consume the response in SolrJ. I have not considered streaming as i am still completely unfamiliar with it and i don't yet know what problems it can solve. One simple solution, in my case would be, now just thinking of it, run the query with no facets and no rows, get the numFound, and set that as facet.limit for the actual query. Are there any examples / articles about consuming streaming facets with SolrJ? Thanks, Markus -Original message- > From:Toke Eskildsen <t...@statsbiblioteket.dk> > Sent: Friday 2nd December 2016 13:01 > To: solr_user lucene_apache <solr-user@lucene.apache.org> > Subject: Re: Solr seems to reserve facet.limit results > > On Fri, 2016-12-02 at 11:21 +, Markus Jelsma wrote: > > Despite the number of actual results, queries with a very high > > facet.limit are three to five times slower compared to much lower > > values. For example, i have a query that returns roughly 19.000 facet > > results. Queries with facet.limit=2 return within 200 ms but > > queries with facet.limit= 20 million return after around 800 ms. This > > is in a cloud environment. > > First all, requesting top.20M facet terms in a multi-node cloud is > really not advisable as the transfer+merge overhead is huge. Have you > considered streaming? > > > I vaguely remember an issue where Solr reserves the requested limit, > > I looked at both simple String faceting and numeric faceting in Solr. > While there are pre-allocations of the structures involved, they both > have build-in limiting, so the large performance difference that you > are seeing is a bit strange. This was with the Solr 5.4 code that I > happened to have open. Which version are you using? > > Just a thought: For plain search, specifying rows=20M is quite > different from rows=20K, as that code does not have the same limiting > as faceting. Are you perchance setting rows together with facet.limit? > > - Toke Eskildsen, State and University Library, Denmark >
Re: Solr seems to reserve facet.limit results
On Fri, 2016-12-02 at 11:21 +, Markus Jelsma wrote: > Despite the number of actual results, queries with a very high > facet.limit are three to five times slower compared to much lower > values. For example, i have a query that returns roughly 19.000 facet > results. Queries with facet.limit=2 return within 200 ms but > queries with facet.limit= 20 million return after around 800 ms. This > is in a cloud environment. First all, requesting top.20M facet terms in a multi-node cloud is really not advisable as the transfer+merge overhead is huge. Have you considered streaming? > I vaguely remember an issue where Solr reserves the requested limit, I looked at both simple String faceting and numeric faceting in Solr. While there are pre-allocations of the structures involved, they both have build-in limiting, so the large performance difference that you are seeing is a bit strange. This was with the Solr 5.4 code that I happened to have open. Which version are you using? Just a thought: For plain search, specifying rows=20M is quite different from rows=20K, as that code does not have the same limiting as faceting. Are you perchance setting rows together with facet.limit? - Toke Eskildsen, State and University Library, Denmark
Solr seems to reserve facet.limit results
Hi - in some cases we want all facets values and counts for a given query, it can be 10k or even 10m but also just one thousand. Despite the number of actual results, queries with a very high facet.limit are three to five times slower compared to much lower values. For example, i have a query that returns roughly 19.000 facet results. Queries with facet.limit=2 return within 200 ms but queries with facet.limit= 20 million return after around 800 ms. This is in a cloud environment. I vaguely remember an issue where Solr reserves the requested limit, is there an open issue about this? Thanks, Markus