Re: Optimizing fq query performance
FYI https://issues.apache.org/jira/browse/SOLR-11437 https://issues.apache.org/jira/browse/SOLR-12488 On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey wrote: > On 4/17/2019 11:49 PM, John Davis wrote: > > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO > > *] doesn't seem materially different compared to has_field:1. If no one > > knows why Lucene optimizes one but not another, it's not clear whether it > > even optimizes one to be sure. > > Queries using a boolean field will be even faster than the all-inclusive > range query ... but they require work at index time to function > properly. If you can do it this way, that's definitely preferred. I > was providing you with something that would work even without the > separate boolean field. > > If the cardinality of the field you're searching is very low (only a few > possible values for that field across the whole index) then a wildcard > query can be fast. It is only when the cardinality is high that the > wildcard query is slow. Still, it is better to use the range query for > determining whether the field exists, unless you have a separate boolean > field for that purpose, in which case the boolean query will be a little > bit faster. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/17/2019 11:49 PM, John Davis wrote: I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO *] doesn't seem materially different compared to has_field:1. If no one knows why Lucene optimizes one but not another, it's not clear whether it even optimizes one to be sure. Queries using a boolean field will be even faster than the all-inclusive range query ... but they require work at index time to function properly. If you can do it this way, that's definitely preferred. I was providing you with something that would work even without the separate boolean field. If the cardinality of the field you're searching is very low (only a few possible values for that field across the whole index) then a wildcard query can be fast. It is only when the cardinality is high that the wildcard query is slow. Still, it is better to use the range query for determining whether the field exists, unless you have a separate boolean field for that purpose, in which case the boolean query will be a little bit faster. Thanks, Shawn
Re: Optimizing fq query performance
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO *] doesn't seem materially different compared to has_field:1. If no one knows why Lucene optimizes one but not another, it's not clear whether it even optimizes one to be sure. On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey wrote: > On 4/17/2019 1:21 PM, John Davis wrote: > > If what you describe is the case for range query [* TO *], why would > lucene > > not optimize field:* similar way? > > I don't know. Low level lucene operation is a mystery to me. > > I have seen first-hand that the range query is MUCH faster than the > wildcard query. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/17/2019 1:21 PM, John Davis wrote: If what you describe is the case for range query [* TO *], why would lucene not optimize field:* similar way? I don't know. Low level lucene operation is a mystery to me. I have seen first-hand that the range query is MUCH faster than the wildcard query. Thanks, Shawn
Re: Optimizing fq query performance
If what you describe is the case for range query [* TO *], why would lucene not optimize field:* similar way? On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey wrote: > On 4/17/2019 10:51 AM, John Davis wrote: > > Can you clarify why field:[* TO *] is lot more efficient than field:* > > It's a range query. For every document, Lucene just has to answer two > questions -- is the value more than any possible value and is the value > less than any possible value. The answer will be yes if the field > exists, and no if it doesn't. With one million documents, there are two > million questions that Lucene has to answer. Which probably seems like > a lot ... but keep reading. (Side note: It wouldn't surprise me if > Lucene has an optimization specifically for the all inclusive range such > that it actually only asks one question, not two) > > With a wildcard query, there are as many questions as there are values > in the field. Every question is asked for every single document. So if > you have a million documents and there are three hundred thousand > different values contained in the field across the whole index, that's > 300 billion questions. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/17/2019 10:51 AM, John Davis wrote: Can you clarify why field:[* TO *] is lot more efficient than field:* It's a range query. For every document, Lucene just has to answer two questions -- is the value more than any possible value and is the value less than any possible value. The answer will be yes if the field exists, and no if it doesn't. With one million documents, there are two million questions that Lucene has to answer. Which probably seems like a lot ... but keep reading. (Side note: It wouldn't surprise me if Lucene has an optimization specifically for the all inclusive range such that it actually only asks one question, not two) With a wildcard query, there are as many questions as there are values in the field. Every question is asked for every single document. So if you have a million documents and there are three hundred thousand different values contained in the field across the whole index, that's 300 billion questions. Thanks, Shawn
Re: Optimizing fq query performance
Can you clarify why field:[* TO *] is lot more efficient than field:* On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey wrote: > On 4/13/2019 12:58 PM, John Davis wrote: > > We noticed a sizable performance degradation when we add certain fq > filters > > to the query even though the result set does not change between the two > > queries. I would've expected solr to optimize internally by picking the > > most constrained fq filter first, but maybe my understanding is wrong. > > All filters cover the entire index, unless the query parser that you're > using implements the PostFilter interface, the filter cost is set high > enough, and caching is disabled. All three of those conditions must be > met in order for a filter to only run on results instead of the entire > index. > > http://yonik.com/advanced-filter-caching-in-solr/ > https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/ > > Most query parsers don't implement the PostFilter interface. The lucene > and edismax parsers do not implement PostFilter. Unless you've > specified the query parser in the fq parameter, it will use the lucene > query parser, and it cannot be a PostFilter. > > > Here's an example: > > > > query1: fq = 'field1:* AND field2:value' > > query2: fq = 'field2:value' > > If the point of the "field1:*" query clause is "make sure field1 exists > in the document" then you would be a lot better off with this query clause: > > field1:[* TO *] > > This is an all-inclusive range query. It works with all field types > where I have tried it, and that includes TextField types. It will be a > lot more efficient than the wildcard query. > > Here's what happens with "field1:*". If the cardinality of field1 is > ten million different values, then the query that gets constructed for > Lucene will literally contain ten million values. And every single one > of them will need to be compared to every document. That's a LOT of > comparisons. Wildcard queries are normally very slow. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/13/2019 12:58 PM, John Davis wrote: We noticed a sizable performance degradation when we add certain fq filters to the query even though the result set does not change between the two queries. I would've expected solr to optimize internally by picking the most constrained fq filter first, but maybe my understanding is wrong. All filters cover the entire index, unless the query parser that you're using implements the PostFilter interface, the filter cost is set high enough, and caching is disabled. All three of those conditions must be met in order for a filter to only run on results instead of the entire index. http://yonik.com/advanced-filter-caching-in-solr/ https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/ Most query parsers don't implement the PostFilter interface. The lucene and edismax parsers do not implement PostFilter. Unless you've specified the query parser in the fq parameter, it will use the lucene query parser, and it cannot be a PostFilter. Here's an example: query1: fq = 'field1:* AND field2:value' query2: fq = 'field2:value' If the point of the "field1:*" query clause is "make sure field1 exists in the document" then you would be a lot better off with this query clause: field1:[* TO *] This is an all-inclusive range query. It works with all field types where I have tried it, and that includes TextField types. It will be a lot more efficient than the wildcard query. Here's what happens with "field1:*". If the cardinality of field1 is ten million different values, then the query that gets constructed for Lucene will literally contain ten million values. And every single one of them will need to be compared to every document. That's a LOT of comparisons. Wildcard queries are normally very slow. Thanks, Shawn
Re: Optimizing fq query performance
Patches welcome, but how would that be done? There’s no fixed schema at the Lucene level. It’s even possible that no two documents in the index have any fields in common. Given the structure of an inverted index, answering the question “for document X does it have any value?" is rather “interesting”. You might be able to do something with docValues and function queries, but that’s overkill. In some sense, fq=field:* does this dynamically by putting the results in the filterCache where it requires no calculations the next time so it seems like more effort than it’s worth. Best, Erick > On Apr 13, 2019, at 11:24 PM, John Davis wrote: > >> field1:* is slow in general for indexed fields because all terms for the >> field need to be iterated (e.g. does term1 match doc1, does term2 match >> doc1, etc) > > This feels like something could be optimized internally by tracking > existence of the field in a doc instead of making users index yet another > field to track existence? > > BTW does this same behavior apply for tlong fields too where the value > might be more continuous vs discrete strings? > > On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley wrote: > >> More constrained but matching the same set of documents just guarantees >> that there is more information to evaluate per document matched. >> For your specific case, you can optimize fq = 'field1:* AND field2:value' >> to &fq=field1:*&fq=field2:value >> This will at least cause field1:* to be cached and reused if it's a common >> pattern. >> field1:* is slow in general for indexed fields because all terms for the >> field need to be iterated (e.g. does term1 match doc1, does term2 match >> doc1, etc) >> One can optimize this by indexing a term in a different field to turn it >> into a single term query (i.e. exists:field1) >> >> -Yonik >> >> On Sat, Apr 13, 2019 at 2:58 PM John Davis >> wrote: >> >>> Hi there, >>> >>> We noticed a sizable performance degradation when we add certain fq >> filters >>> to the query even though the result set does not change between the two >>> queries. I would've expected solr to optimize internally by picking the >>> most constrained fq filter first, but maybe my understanding is wrong. >>> Here's an example: >>> >>> query1: fq = 'field1:* AND field2:value' >>> query2: fq = 'field2:value' >>> >>> If we assume that the result set is identical between the two queries and >>> field1 is in general more frequent in the index, we noticed query1 takes >>> 100x longer than query2. In case it matters field1 is of type tlongs >> while >>> field2 is a string. >>> >>> Any tips for optimizing this? >>> >>> John >>> >>
Re: Optimizing fq query performance
> field1:* is slow in general for indexed fields because all terms for the > field need to be iterated (e.g. does term1 match doc1, does term2 match > doc1, etc) This feels like something could be optimized internally by tracking existence of the field in a doc instead of making users index yet another field to track existence? BTW does this same behavior apply for tlong fields too where the value might be more continuous vs discrete strings? On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley wrote: > More constrained but matching the same set of documents just guarantees > that there is more information to evaluate per document matched. > For your specific case, you can optimize fq = 'field1:* AND field2:value' > to &fq=field1:*&fq=field2:value > This will at least cause field1:* to be cached and reused if it's a common > pattern. > field1:* is slow in general for indexed fields because all terms for the > field need to be iterated (e.g. does term1 match doc1, does term2 match > doc1, etc) > One can optimize this by indexing a term in a different field to turn it > into a single term query (i.e. exists:field1) > > -Yonik > > On Sat, Apr 13, 2019 at 2:58 PM John Davis > wrote: > > > Hi there, > > > > We noticed a sizable performance degradation when we add certain fq > filters > > to the query even though the result set does not change between the two > > queries. I would've expected solr to optimize internally by picking the > > most constrained fq filter first, but maybe my understanding is wrong. > > Here's an example: > > > > query1: fq = 'field1:* AND field2:value' > > query2: fq = 'field2:value' > > > > If we assume that the result set is identical between the two queries and > > field1 is in general more frequent in the index, we noticed query1 takes > > 100x longer than query2. In case it matters field1 is of type tlongs > while > > field2 is a string. > > > > Any tips for optimizing this? > > > > John > > >
Re: Optimizing fq query performance
Also note that field1:* does not necessarily match all documents. A document without that field will not match. So it really can’t be optimized they way you might expect since, as Yonik says, all the terms have to be enumerated…. Best, Erick > On Apr 13, 2019, at 12:30 PM, Yonik Seeley wrote: > > More constrained but matching the same set of documents just guarantees > that there is more information to evaluate per document matched. > For your specific case, you can optimize fq = 'field1:* AND field2:value' > to &fq=field1:*&fq=field2:value > This will at least cause field1:* to be cached and reused if it's a common > pattern. > field1:* is slow in general for indexed fields because all terms for the > field need to be iterated (e.g. does term1 match doc1, does term2 match > doc1, etc) > One can optimize this by indexing a term in a different field to turn it > into a single term query (i.e. exists:field1) > > -Yonik > > On Sat, Apr 13, 2019 at 2:58 PM John Davis > wrote: > >> Hi there, >> >> We noticed a sizable performance degradation when we add certain fq filters >> to the query even though the result set does not change between the two >> queries. I would've expected solr to optimize internally by picking the >> most constrained fq filter first, but maybe my understanding is wrong. >> Here's an example: >> >> query1: fq = 'field1:* AND field2:value' >> query2: fq = 'field2:value' >> >> If we assume that the result set is identical between the two queries and >> field1 is in general more frequent in the index, we noticed query1 takes >> 100x longer than query2. In case it matters field1 is of type tlongs while >> field2 is a string. >> >> Any tips for optimizing this? >> >> John >>
Re: Optimizing fq query performance
More constrained but matching the same set of documents just guarantees that there is more information to evaluate per document matched. For your specific case, you can optimize fq = 'field1:* AND field2:value' to &fq=field1:*&fq=field2:value This will at least cause field1:* to be cached and reused if it's a common pattern. field1:* is slow in general for indexed fields because all terms for the field need to be iterated (e.g. does term1 match doc1, does term2 match doc1, etc) One can optimize this by indexing a term in a different field to turn it into a single term query (i.e. exists:field1) -Yonik On Sat, Apr 13, 2019 at 2:58 PM John Davis wrote: > Hi there, > > We noticed a sizable performance degradation when we add certain fq filters > to the query even though the result set does not change between the two > queries. I would've expected solr to optimize internally by picking the > most constrained fq filter first, but maybe my understanding is wrong. > Here's an example: > > query1: fq = 'field1:* AND field2:value' > query2: fq = 'field2:value' > > If we assume that the result set is identical between the two queries and > field1 is in general more frequent in the index, we noticed query1 takes > 100x longer than query2. In case it matters field1 is of type tlongs while > field2 is a string. > > Any tips for optimizing this? > > John >