Query performance degrades when TLOG replica
We have the following setup , solr 7.7.2 with 1 TLOG Leader & 1 TLOG replica with a single shard. We have about 34.5 million documents with an approximate index size of 600GB. I have noticed a degraded query performance whenever the replica is trying to (guessing here) sync or perform actual replication. To test this, I fire a very basic query using solrj client & the query comes back right away, but whenever the replication is trying to see how far behind it is by comparing the generation ids the same queries take longer. In production we do not make these simple queries, but rather complex queries with filter queries & sorting. These queries take too long as compared to our previous (standalone solr 6.1.0) Any help here is appreciated 20-09-02 16:35:30 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=0 2020-09-02 16:35:30 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=0 2020-09-02 16:36:00 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=0 2020-09-02 16:36:00 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=0 2020-09-02 16:36:30 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=0 2020-09-02 16:36:30 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=0 *2020-09-02 16:37:01* INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=*1011* *2020-09-02 16:37:01* INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909 status=0 QTime=*758* *2020-09-02 16:37:32* INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458957 status=0 QTime=*1077* *2020-09-02 16:37:32* INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458957 status=0 QTime=*1081* 2020-09-02 16:38:02 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458957 status=0 QTime=*668* 2020-09-02 16:38:03 INFO [db_shard1_replica_t3] webapp=/solr path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458957 status=0 QTime=*1001* *2020-09-02 16:37:01* INFO Master's generation: 263116 *2020-09-02 16:37:01* INFO Master's version: 1599064577322 *2020-09-02 16:37:01* INFO Slave's generation: 263116 *2020-09-02 16:37:01* INFO Slave's version: 1599064577322 *2020-09-02 16:37:01* INFO Slave in sync with master. 2020-09-02 16:37:02 INFO Master's generation: 104189 2020-09-02 16:37:02 INFO Master's version: 1599064620532 2020-09-02 16:37:02 INFO Slave's generation: 104188 2020-09-02 16:37:02 INFO Slave's version: 1599064560341 2020-09-02 16:37:02 INFO Starting replication process 2020-09-02 16:37:02 INFO Number of files in latest index in master: 1010 2020-09-02 16:37:02 INFO Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/opt/solr-7.7.2/server/solr/test_shard1_replica_t3/data/index.20200902163702345 lockFactory=org.apache.lucene.store.NativeFSLockFactory@77247ee; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2020-09-02 16:37:02 INFO Bytes downloaded: 837587, Bytes skipped downloading: 0 2020-09-02 16:37:02 INFO Total time taken for download (fullCopy=false,bytesDownloaded=837587) : 0 secs (null bytes/sec) to NRTCachingDirectory(MMapDirectory@/opt/solr-7.7.2/server/solr/test_shard1_replica_t3/data/index.20200902163702345 lockFactory=org.apache.lucene.store.NativeFSLockFactory@77247ee; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2020-09-02 16:37:03 INFO New IndexWriter is ready to be used. 2020-09-02 16:37:03 INFO Master's generation: 124002 2020-09-02 16:37:03 INFO Master's version: 1599064617242 2020-09-02 16:37:03 INFO Slave's generation: 124000 2020-09-02 16:37:03 INFO Slave's version: 1599064492914 2020-09-02 16:37:03 INFO Starting replication process 2020-09-02 16:37:04 INFO [db_shard1_replica_t3] webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from= http://178.33.234.1:8983/solr/db_shard1_replica_t25/&wt=javabin&version=2}{add=[11
Re: Facet Query performance
On 7/8/2019 12:00 PM, Midas A wrote: Number of Docs :50+ docs Index Size: 300 GB RAM: 256 GB JVM: 32 GB Half a million documents producing an index size of 300GB suggests *very* large documents. That typically produces an index with fields that have very high cardinality, due to text tokenization. Is Solr the only thing running on this machine, or does it have other memory-hungry software running on it? The screenshot described at the following URL may provide more insight. It will be important to get the sort correct. If the columns have been customized to show information other than the examples, it may need to be adjusted: https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue Assuming that Solr is the only thing on the machine, then it means you have about 224 GB of memory available to cache your index data, which is at least 300GB. Normally I would think being able to cache two thirds of the index should be enough for good performance, but it's always possible that there is something about your setup that means you don't have enough memory. Are you sure that you need a 32GB heap? Half a million documents should NOT require anywhere near that much heap. Cardinality: cat=44 rol=1005 ind=504 cl=2000 These cardinality values are VERY low. If you are certain about those numbers, it is not likely that these fields are significant contributors to query time, either with or without docValues. How did you obtain those numbers? Those are not the only fields referenced in your query. I also see these: hemp cEmp pEmp is_udis id is_resume upt_date country exp ctc contents currdesig predesig lng ttl kw_sql kw_it QTime: 2988 ms Three seconds for a query with so many facets is something I would probably be pretty happy to get. Our 35% queries takes more than 10 sec. I have no idea what this sentence means. Please suggest the ways to improve response time . Attached queries and schema.xml and solrconfig.xml 1. Is there any other ways to rewrite queries that improve our query performance .? With the information available, the only suggestion I have currently is to replace "q=*" with "q=*:*" -- assuming that the intent is to match all documents with the main query. According to what you attached (which I am very surprised to see -- attachments usually don't make it to the list), your df parameter is "ttl" ... a field that is heavily tokenized. That means that the cardinality of the ttl field is probably VERY high, which would make the wildcard query VERY slow. 2. can we see the DocValues cache in plugin/ stats->cache-> section on solr UI panel ? The admin UI only shows Solr caches. If Lucene even has a docValues cache (and I do not know whether it does), it will not be available in Solr's statistics. I am unaware of any cache in Solr for docValues. The entire point of docValues is to avoid the need to generate and cache large amounts of data, so I suspect there is not going to be anything available in this regard. Thanks, Shawn
Re: Facet Query performance
On 7/8/2019 3:08 AM, Midas A wrote: I have enabled docvalues on facet field but query is still taking time. How i can improve the Query time . docValues="true" multiValued="true" termVectors="true" /> *Query: * There's very little information here -- only a single field definition and the query URL. No information about how many documents, what sort of cardinality there is in the fields being used in the query, no information about memory and settings, etc. You haven't even told us how long the query takes. Your main query is a single * wildcard. A wildcard query is typically quite slow. If you are aiming for all documents, change that to q=*:* instead -- this is special syntax that the query parser understands, and is normally executed very quickly. When a field has DocValues defined, it will automatically be used for field-based sorting, field-based facets, and field-based grouping. DocValues should not be relied on for queries, because indexed data is far faster for that usage. Queries *can* be done with docValues, but it would be VERY slow. Solr will avoid that usage if it can. I'm reasonably certain that docValues will NOT be used for facet.query as long as the field is indexed. You do have three-field based facets -- using the facet.field parameter. If docValues was present on cat for ALL of the indexing that has happened, then they will work for that field, but you have not told us whether rol and pref have them defined. You have a lot of faceting in this query. That can cause things to be slow. Thanks, Shawn
Re: Facet Query performance
Hi How i can know whether DocValues are getting used or not ? Please help me here . On Mon, Jul 8, 2019 at 2:38 PM Midas A wrote: > Hi , > > I have enabled docvalues on facet field but query is still taking time. > > How i can improve the Query time . > docValues="true" multiValued="true" termVectors="true" /> > > *Query: * > http://X.X.X.X: > /solr/search/select?df=ttl&ps=0&hl=true&fl=id,upt&f.ind.mincount=1&hl.usePhraseHighlighter=true&f.pref.mincount=1&q.op=OR&fq=NOT+hemp:(%22xgidx29760%22+%22xmwxmonster%22+%22xmwxmonsterindia%22+%22xmwxcom%22+%22xswxmonster+com%22+%22xswxmonster%22+%22xswxmonsterindia+com%22+%22xswxmonsterindia%22)&fq=NOT+cEmp:(% > 22nomster.com%22+OR+%22utyu%22)&fq=NOT+pEmp:(%22nomster.com > %22+OR+%22utyu%22)&fq=ind:(5)&fq=NOT+is_udis:2&fq=NOT+id:(92197+OR+240613+OR+249717+OR+1007148+OR+2500513+OR+2534675+OR+2813498+OR+9401682)&lowercaseOperators=true&ps2=0&bq=is_resume:0^-1000&bq=upt_date:[*+TO+NOW/DAY-36MONTHS]^2&bq=upt_date:[NOW/DAY-36MONTHS+TO+NOW/DAY-24MONTHS]^3&bq=upt_date:[NOW/DAY-24MONTHS+TO+NOW/DAY-12MONTHS]^4&bq=upt_date:[NOW/DAY-12MONTHS+TO+NOW/DAY-9MONTHS]^5&bq=upt_date:[NOW/DAY-9MONTHS+TO+NOW/DAY-6MONTHS]^10&bq=upt_date:[NOW/DAY-6MONTHS+TO+NOW/DAY-3MONTHS]^15&bq=upt_date:[NOW/DAY-3MONTHS+TO+*]^20&bq=NOT+country:isoin^-10&facet.query=exp:[+10+TO+11+]&facet.query=exp:[+11+TO+13+]&facet.query=exp:[+13+TO+15+]&facet.query=exp:[+15+TO+17+]&facet.query=exp:[+17+TO+20+]&facet.query=exp:[+20+TO+25+]&facet.query=exp:[+25+TO+109+]&facet.query=ctc:[+100+TO+101+]&facet.query=ctc:[+101+TO+101.5+]&facet.query=ctc:[+101.5+TO+102+]&facet.query=ctc:[+102+TO+103+]&facet.query=ctc:[+103+TO+104+]&facet.query=ctc:[+104+TO+105+]&facet.query=ctc:[+105+TO+107.5+]&facet.query=ctc:[+107.5+TO+110+]&facet.query=ctc:[+110+TO+115+]&facet.query=ctc:[+115+TO+10100+]&ps3=0&qf=contents^0.05+currdesig^1.5+predesig^1.5+lng^2+ttl+kw_skl+kw_it&f.cl.mincount=1&sow=false&hl.fl=ttl,kw_skl,kw_it,contents&wt=json&f.cat.mincount=1&qs=0&facet.field=ind&facet.field=cat&facet.field=rol&facet.field=cl&facet.field=pref&debug=timing&qt=/resumesearch&f.rol.mincount=1&start=0&rows=40&version=2&q=*&facet.limit=10&pf=id&hl.q=&facet.mincount=1&pf3=id&pf2=id&facet=true&debugQuery=false > >
Facet Query performance
Hi , I have enabled docvalues on facet field but query is still taking time. How i can improve the Query time . *Query: * http://X.X.X.X: /solr/search/select?df=ttl&ps=0&hl=true&fl=id,upt&f.ind.mincount=1&hl.usePhraseHighlighter=true&f.pref.mincount=1&q.op=OR&fq=NOT+hemp:(%22xgidx29760%22+%22xmwxmonster%22+%22xmwxmonsterindia%22+%22xmwxcom%22+%22xswxmonster+com%22+%22xswxmonster%22+%22xswxmonsterindia+com%22+%22xswxmonsterindia%22)&fq=NOT+cEmp:(% 22nomster.com%22+OR+%22utyu%22)&fq=NOT+pEmp:(%22nomster.com %22+OR+%22utyu%22)&fq=ind:(5)&fq=NOT+is_udis:2&fq=NOT+id:(92197+OR+240613+OR+249717+OR+1007148+OR+2500513+OR+2534675+OR+2813498+OR+9401682)&lowercaseOperators=true&ps2=0&bq=is_resume:0^-1000&bq=upt_date:[*+TO+NOW/DAY-36MONTHS]^2&bq=upt_date:[NOW/DAY-36MONTHS+TO+NOW/DAY-24MONTHS]^3&bq=upt_date:[NOW/DAY-24MONTHS+TO+NOW/DAY-12MONTHS]^4&bq=upt_date:[NOW/DAY-12MONTHS+TO+NOW/DAY-9MONTHS]^5&bq=upt_date:[NOW/DAY-9MONTHS+TO+NOW/DAY-6MONTHS]^10&bq=upt_date:[NOW/DAY-6MONTHS+TO+NOW/DAY-3MONTHS]^15&bq=upt_date:[NOW/DAY-3MONTHS+TO+*]^20&bq=NOT+country:isoin^-10&facet.query=exp:[+10+TO+11+]&facet.query=exp:[+11+TO+13+]&facet.query=exp:[+13+TO+15+]&facet.query=exp:[+15+TO+17+]&facet.query=exp:[+17+TO+20+]&facet.query=exp:[+20+TO+25+]&facet.query=exp:[+25+TO+109+]&facet.query=ctc:[+100+TO+101+]&facet.query=ctc:[+101+TO+101.5+]&facet.query=ctc:[+101.5+TO+102+]&facet.query=ctc:[+102+TO+103+]&facet.query=ctc:[+103+TO+104+]&facet.query=ctc:[+104+TO+105+]&facet.query=ctc:[+105+TO+107.5+]&facet.query=ctc:[+107.5+TO+110+]&facet.query=ctc:[+110+TO+115+]&facet.query=ctc:[+115+TO+10100+]&ps3=0&qf=contents^0.05+currdesig^1.5+predesig^1.5+lng^2+ttl+kw_skl+kw_it&f.cl.mincount=1&sow=false&hl.fl=ttl,kw_skl,kw_it,contents&wt=json&f.cat.mincount=1&qs=0&facet.field=ind&facet.field=cat&facet.field=rol&facet.field=cl&facet.field=pref&debug=timing&qt=/resumesearch&f.rol.mincount=1&start=0&rows=40&version=2&q=*&facet.limit=10&pf=id&hl.q=&facet.mincount=1&pf3=id&pf2=id&facet=true&debugQuery=false
Re: Optimizing fq query performance
FYI https://issues.apache.org/jira/browse/SOLR-11437 https://issues.apache.org/jira/browse/SOLR-12488 On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey wrote: > On 4/17/2019 11:49 PM, John Davis wrote: > > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO > > *] doesn't seem materially different compared to has_field:1. If no one > > knows why Lucene optimizes one but not another, it's not clear whether it > > even optimizes one to be sure. > > Queries using a boolean field will be even faster than the all-inclusive > range query ... but they require work at index time to function > properly. If you can do it this way, that's definitely preferred. I > was providing you with something that would work even without the > separate boolean field. > > If the cardinality of the field you're searching is very low (only a few > possible values for that field across the whole index) then a wildcard > query can be fast. It is only when the cardinality is high that the > wildcard query is slow. Still, it is better to use the range query for > determining whether the field exists, unless you have a separate boolean > field for that purpose, in which case the boolean query will be a little > bit faster. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/17/2019 11:49 PM, John Davis wrote: I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO *] doesn't seem materially different compared to has_field:1. If no one knows why Lucene optimizes one but not another, it's not clear whether it even optimizes one to be sure. Queries using a boolean field will be even faster than the all-inclusive range query ... but they require work at index time to function properly. If you can do it this way, that's definitely preferred. I was providing you with something that would work even without the separate boolean field. If the cardinality of the field you're searching is very low (only a few possible values for that field across the whole index) then a wildcard query can be fast. It is only when the cardinality is high that the wildcard query is slow. Still, it is better to use the range query for determining whether the field exists, unless you have a separate boolean field for that purpose, in which case the boolean query will be a little bit faster. Thanks, Shawn
Re: Optimizing fq query performance
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO *] doesn't seem materially different compared to has_field:1. If no one knows why Lucene optimizes one but not another, it's not clear whether it even optimizes one to be sure. On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey wrote: > On 4/17/2019 1:21 PM, John Davis wrote: > > If what you describe is the case for range query [* TO *], why would > lucene > > not optimize field:* similar way? > > I don't know. Low level lucene operation is a mystery to me. > > I have seen first-hand that the range query is MUCH faster than the > wildcard query. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/17/2019 1:21 PM, John Davis wrote: If what you describe is the case for range query [* TO *], why would lucene not optimize field:* similar way? I don't know. Low level lucene operation is a mystery to me. I have seen first-hand that the range query is MUCH faster than the wildcard query. Thanks, Shawn
Re: Optimizing fq query performance
If what you describe is the case for range query [* TO *], why would lucene not optimize field:* similar way? On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey wrote: > On 4/17/2019 10:51 AM, John Davis wrote: > > Can you clarify why field:[* TO *] is lot more efficient than field:* > > It's a range query. For every document, Lucene just has to answer two > questions -- is the value more than any possible value and is the value > less than any possible value. The answer will be yes if the field > exists, and no if it doesn't. With one million documents, there are two > million questions that Lucene has to answer. Which probably seems like > a lot ... but keep reading. (Side note: It wouldn't surprise me if > Lucene has an optimization specifically for the all inclusive range such > that it actually only asks one question, not two) > > With a wildcard query, there are as many questions as there are values > in the field. Every question is asked for every single document. So if > you have a million documents and there are three hundred thousand > different values contained in the field across the whole index, that's > 300 billion questions. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/17/2019 10:51 AM, John Davis wrote: Can you clarify why field:[* TO *] is lot more efficient than field:* It's a range query. For every document, Lucene just has to answer two questions -- is the value more than any possible value and is the value less than any possible value. The answer will be yes if the field exists, and no if it doesn't. With one million documents, there are two million questions that Lucene has to answer. Which probably seems like a lot ... but keep reading. (Side note: It wouldn't surprise me if Lucene has an optimization specifically for the all inclusive range such that it actually only asks one question, not two) With a wildcard query, there are as many questions as there are values in the field. Every question is asked for every single document. So if you have a million documents and there are three hundred thousand different values contained in the field across the whole index, that's 300 billion questions. Thanks, Shawn
Re: Optimizing fq query performance
Can you clarify why field:[* TO *] is lot more efficient than field:* On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey wrote: > On 4/13/2019 12:58 PM, John Davis wrote: > > We noticed a sizable performance degradation when we add certain fq > filters > > to the query even though the result set does not change between the two > > queries. I would've expected solr to optimize internally by picking the > > most constrained fq filter first, but maybe my understanding is wrong. > > All filters cover the entire index, unless the query parser that you're > using implements the PostFilter interface, the filter cost is set high > enough, and caching is disabled. All three of those conditions must be > met in order for a filter to only run on results instead of the entire > index. > > http://yonik.com/advanced-filter-caching-in-solr/ > https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/ > > Most query parsers don't implement the PostFilter interface. The lucene > and edismax parsers do not implement PostFilter. Unless you've > specified the query parser in the fq parameter, it will use the lucene > query parser, and it cannot be a PostFilter. > > > Here's an example: > > > > query1: fq = 'field1:* AND field2:value' > > query2: fq = 'field2:value' > > If the point of the "field1:*" query clause is "make sure field1 exists > in the document" then you would be a lot better off with this query clause: > > field1:[* TO *] > > This is an all-inclusive range query. It works with all field types > where I have tried it, and that includes TextField types. It will be a > lot more efficient than the wildcard query. > > Here's what happens with "field1:*". If the cardinality of field1 is > ten million different values, then the query that gets constructed for > Lucene will literally contain ten million values. And every single one > of them will need to be compared to every document. That's a LOT of > comparisons. Wildcard queries are normally very slow. > > Thanks, > Shawn >
Re: Optimizing fq query performance
On 4/13/2019 12:58 PM, John Davis wrote: We noticed a sizable performance degradation when we add certain fq filters to the query even though the result set does not change between the two queries. I would've expected solr to optimize internally by picking the most constrained fq filter first, but maybe my understanding is wrong. All filters cover the entire index, unless the query parser that you're using implements the PostFilter interface, the filter cost is set high enough, and caching is disabled. All three of those conditions must be met in order for a filter to only run on results instead of the entire index. http://yonik.com/advanced-filter-caching-in-solr/ https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/ Most query parsers don't implement the PostFilter interface. The lucene and edismax parsers do not implement PostFilter. Unless you've specified the query parser in the fq parameter, it will use the lucene query parser, and it cannot be a PostFilter. Here's an example: query1: fq = 'field1:* AND field2:value' query2: fq = 'field2:value' If the point of the "field1:*" query clause is "make sure field1 exists in the document" then you would be a lot better off with this query clause: field1:[* TO *] This is an all-inclusive range query. It works with all field types where I have tried it, and that includes TextField types. It will be a lot more efficient than the wildcard query. Here's what happens with "field1:*". If the cardinality of field1 is ten million different values, then the query that gets constructed for Lucene will literally contain ten million values. And every single one of them will need to be compared to every document. That's a LOT of comparisons. Wildcard queries are normally very slow. Thanks, Shawn
Re: Optimizing fq query performance
Patches welcome, but how would that be done? There’s no fixed schema at the Lucene level. It’s even possible that no two documents in the index have any fields in common. Given the structure of an inverted index, answering the question “for document X does it have any value?" is rather “interesting”. You might be able to do something with docValues and function queries, but that’s overkill. In some sense, fq=field:* does this dynamically by putting the results in the filterCache where it requires no calculations the next time so it seems like more effort than it’s worth. Best, Erick > On Apr 13, 2019, at 11:24 PM, John Davis wrote: > >> field1:* is slow in general for indexed fields because all terms for the >> field need to be iterated (e.g. does term1 match doc1, does term2 match >> doc1, etc) > > This feels like something could be optimized internally by tracking > existence of the field in a doc instead of making users index yet another > field to track existence? > > BTW does this same behavior apply for tlong fields too where the value > might be more continuous vs discrete strings? > > On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley wrote: > >> More constrained but matching the same set of documents just guarantees >> that there is more information to evaluate per document matched. >> For your specific case, you can optimize fq = 'field1:* AND field2:value' >> to &fq=field1:*&fq=field2:value >> This will at least cause field1:* to be cached and reused if it's a common >> pattern. >> field1:* is slow in general for indexed fields because all terms for the >> field need to be iterated (e.g. does term1 match doc1, does term2 match >> doc1, etc) >> One can optimize this by indexing a term in a different field to turn it >> into a single term query (i.e. exists:field1) >> >> -Yonik >> >> On Sat, Apr 13, 2019 at 2:58 PM John Davis >> wrote: >> >>> Hi there, >>> >>> We noticed a sizable performance degradation when we add certain fq >> filters >>> to the query even though the result set does not change between the two >>> queries. I would've expected solr to optimize internally by picking the >>> most constrained fq filter first, but maybe my understanding is wrong. >>> Here's an example: >>> >>> query1: fq = 'field1:* AND field2:value' >>> query2: fq = 'field2:value' >>> >>> If we assume that the result set is identical between the two queries and >>> field1 is in general more frequent in the index, we noticed query1 takes >>> 100x longer than query2. In case it matters field1 is of type tlongs >> while >>> field2 is a string. >>> >>> Any tips for optimizing this? >>> >>> John >>> >>
Re: Optimizing fq query performance
> field1:* is slow in general for indexed fields because all terms for the > field need to be iterated (e.g. does term1 match doc1, does term2 match > doc1, etc) This feels like something could be optimized internally by tracking existence of the field in a doc instead of making users index yet another field to track existence? BTW does this same behavior apply for tlong fields too where the value might be more continuous vs discrete strings? On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley wrote: > More constrained but matching the same set of documents just guarantees > that there is more information to evaluate per document matched. > For your specific case, you can optimize fq = 'field1:* AND field2:value' > to &fq=field1:*&fq=field2:value > This will at least cause field1:* to be cached and reused if it's a common > pattern. > field1:* is slow in general for indexed fields because all terms for the > field need to be iterated (e.g. does term1 match doc1, does term2 match > doc1, etc) > One can optimize this by indexing a term in a different field to turn it > into a single term query (i.e. exists:field1) > > -Yonik > > On Sat, Apr 13, 2019 at 2:58 PM John Davis > wrote: > > > Hi there, > > > > We noticed a sizable performance degradation when we add certain fq > filters > > to the query even though the result set does not change between the two > > queries. I would've expected solr to optimize internally by picking the > > most constrained fq filter first, but maybe my understanding is wrong. > > Here's an example: > > > > query1: fq = 'field1:* AND field2:value' > > query2: fq = 'field2:value' > > > > If we assume that the result set is identical between the two queries and > > field1 is in general more frequent in the index, we noticed query1 takes > > 100x longer than query2. In case it matters field1 is of type tlongs > while > > field2 is a string. > > > > Any tips for optimizing this? > > > > John > > >
Re: Optimizing fq query performance
Also note that field1:* does not necessarily match all documents. A document without that field will not match. So it really can’t be optimized they way you might expect since, as Yonik says, all the terms have to be enumerated…. Best, Erick > On Apr 13, 2019, at 12:30 PM, Yonik Seeley wrote: > > More constrained but matching the same set of documents just guarantees > that there is more information to evaluate per document matched. > For your specific case, you can optimize fq = 'field1:* AND field2:value' > to &fq=field1:*&fq=field2:value > This will at least cause field1:* to be cached and reused if it's a common > pattern. > field1:* is slow in general for indexed fields because all terms for the > field need to be iterated (e.g. does term1 match doc1, does term2 match > doc1, etc) > One can optimize this by indexing a term in a different field to turn it > into a single term query (i.e. exists:field1) > > -Yonik > > On Sat, Apr 13, 2019 at 2:58 PM John Davis > wrote: > >> Hi there, >> >> We noticed a sizable performance degradation when we add certain fq filters >> to the query even though the result set does not change between the two >> queries. I would've expected solr to optimize internally by picking the >> most constrained fq filter first, but maybe my understanding is wrong. >> Here's an example: >> >> query1: fq = 'field1:* AND field2:value' >> query2: fq = 'field2:value' >> >> If we assume that the result set is identical between the two queries and >> field1 is in general more frequent in the index, we noticed query1 takes >> 100x longer than query2. In case it matters field1 is of type tlongs while >> field2 is a string. >> >> Any tips for optimizing this? >> >> John >>
Re: Optimizing fq query performance
More constrained but matching the same set of documents just guarantees that there is more information to evaluate per document matched. For your specific case, you can optimize fq = 'field1:* AND field2:value' to &fq=field1:*&fq=field2:value This will at least cause field1:* to be cached and reused if it's a common pattern. field1:* is slow in general for indexed fields because all terms for the field need to be iterated (e.g. does term1 match doc1, does term2 match doc1, etc) One can optimize this by indexing a term in a different field to turn it into a single term query (i.e. exists:field1) -Yonik On Sat, Apr 13, 2019 at 2:58 PM John Davis wrote: > Hi there, > > We noticed a sizable performance degradation when we add certain fq filters > to the query even though the result set does not change between the two > queries. I would've expected solr to optimize internally by picking the > most constrained fq filter first, but maybe my understanding is wrong. > Here's an example: > > query1: fq = 'field1:* AND field2:value' > query2: fq = 'field2:value' > > If we assume that the result set is identical between the two queries and > field1 is in general more frequent in the index, we noticed query1 takes > 100x longer than query2. In case it matters field1 is of type tlongs while > field2 is a string. > > Any tips for optimizing this? > > John >
Optimizing fq query performance
Hi there, We noticed a sizable performance degradation when we add certain fq filters to the query even though the result set does not change between the two queries. I would've expected solr to optimize internally by picking the most constrained fq filter first, but maybe my understanding is wrong. Here's an example: query1: fq = 'field1:* AND field2:value' query2: fq = 'field2:value' If we assume that the result set is identical between the two queries and field1 is in general more frequent in the index, we noticed query1 takes 100x longer than query2. In case it matters field1 is of type tlongs while field2 is a string. Any tips for optimizing this? John
Benchmarking Solr Query performance
Hi all, We would like to perform a benchmark of https://issues.apache.org/jira/browse/SOLR-11831 The patch improves the performance of grouped queries asking only for one result per group (aka. group.limit=1). I remember seeing a page showing a benchmark of the query performance on Wikipedia, Do you know if there is a way in solr to reproduce the same benchmark? Or some independent library to do that? thanks, Diego
Re: EXT: Re: Solr Query Performance benchmarking
Thanks everyone for taking time to respond to my email. I think you are correct in that the query results might be coming from main memory as I only had around 7k queries. However it is still not clear to me, given that everything was being served from main memory, why is that I am not able to push the CPU usage further up by putting more load on the cluster? Thanks Suresh On 4/28/17, 6:44 PM, "Shawn Heisey" wrote: >On 4/28/2017 12:43 PM, Toke Eskildsen wrote: >> Shawn Heisey wrote: >>> Adding more shards as Toke suggested *might* help,[...] >> I seem to have phrased my suggestion poorly. What I meant to suggest >> was a switch to a single shard (with 4 replicas) setup, instead of the >> current 2 shards (with 2 replicas). > >Reading it a second time, it's me who made the error here. You did say >1 shard and 4 replicas, I didn't read it correctly. > >Apologies! > >Thanks, >Shawn > >
Re: Solr Query Performance benchmarking
On 4/28/2017 12:43 PM, Toke Eskildsen wrote: > Shawn Heisey wrote: >> Adding more shards as Toke suggested *might* help,[...] > I seem to have phrased my suggestion poorly. What I meant to suggest > was a switch to a single shard (with 4 replicas) setup, instead of the > current 2 shards (with 2 replicas). Reading it a second time, it's me who made the error here. You did say 1 shard and 4 replicas, I didn't read it correctly. Apologies! Thanks, Shawn
RE: Solr Query Performance benchmarking
Beautiful, thank you. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Friday, April 28, 2017 3:07 PM To: solr-user@lucene.apache.org Subject: Re: Solr Query Performance benchmarking I use the JMeter plugins. They’ve been reorganized recently, so they aren’t where I originally downloaded them. Try this: https://jmeter-plugins.org/wiki/RespTimePercentiles/ <https://jmeter-plugins.org/wiki/RespTimePercentiles/> https://jmeter-plugins.org/wiki/JMeterPluginsCMD/ <https://jmeter-plugins.org/wiki/JMeterPluginsCMD/> Here is the command. It processes the previous JTL output file and puts the result in test.csv. java -Xmx2g -jar CMDRunner.jar --tool Reporter --generate-csv ${prev_dir}/${test} \ --input-jtl ${prev_dir}/${out} --plugin-type ResponseTimesPercentiles \ >> $logfile 2>&1 The script prints a summary of the run. I need to fix that to also print out the header for the columns. pct25=`grep "^25.0," ${test} | cut -d , -f 2-` median=`grep "^50.0," ${test} | cut -d , -f 2-` pct75=`grep "^75.0," ${test} | cut -d , -f 2-` pct90=`grep "^90.0," ${test} | cut -d , -f 2-` pct95=`grep "^95.0," ${test} | cut -d , -f 2-` echo `date` ": 25th percentiles are $pct25" echo `date` ": medians are $median" echo `date` ": 75th percentiles are $pct75" echo `date` ": 90th percentiles are $pct90" echo `date` ": 95th percentiles are $pct95" echo `date` ": full results are in ${test}" wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2017, at 12:00 PM, Davis, Daniel (NIH/NLM) [C] > wrote: > > Walter, > > If you can share a pointer to that JMeter add-on, I'd love it. > > -Original Message- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Friday, April 28, 2017 2:53 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Query Performance benchmarking > > I use production logs to get a mix of common and long-tail queries. It is > very hard to get a realistic distribution with synthetic queries. > > A benchmark run goes like this, with a big shell script driving it. > > 1. Reload the collection to clear caches. > 2. Split the log into a cache warming set (usually the first 2000 queries) > and the rest. > 3. Run the warming set with four threads and no delay. This gets it done but > usually does not overload the server. > 4. Run the test set with hundreds of threads, each set for a particular rate. > The overall config is usually between 2000 and 10,000 requests per minute. > 5. Tests run for 1-2 hours. > 6. Grep the results for non-200 responses, filter them out, and report. > 7. Post process the results to make a CSV file of the percentile response > times, one column for each request handler. > > The benchmark driver is a headless JMeter, run with two different config > files (warming and test). The post processing is a JMeter add-on. > > If the CPU gets over about 60% or the run queue gets to about the number of > processors, the hosts are near congestion. The response time will spike if it > is pushed harder than that. > > Prod logs are usually from a few hours of peak traffic during the daytime. > This reduces the amount of bot traffic in the logs. I filter out load > balancer health checks, Zabbix checks, and so on. I like to get a log of a > million queries. That might require grabbing pen traffic logs from several > days. > > With the master/slave cluster, I use logs from a single slave. Those will > have a lower cache hit rate because the requests are randomly spread out. For > our Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive! > > There a script in the JMeter config to make /handler and /select?qt=/handler > get reported as the same thing. Thank you SolrJ. > > Our SLAs are for 95th percentile. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Apr 28, 2017, at 11:39 AM, Erick Erickson wrote: >> >> Well, the best way to get no cache hits is to set the cache sizes to >> zero ;). That provides worst-case scenarios and tells you exactly how >> much you're relying on caches. I'm not talking the lower-level Lucene >> caches here. >> >> One thing I've done is use the TermsComponent to generate a list of >> terms actually in my corpus, and save them away "somewhere" to >> substitute into my queries. The problem with that is when you have >> anything except very simple queries involving AND, you generate >> unrealistic queries when you substitute in random val
Re: Solr Query Performance benchmarking
I use the JMeter plugins. They’ve been reorganized recently, so they aren’t where I originally downloaded them. Try this: https://jmeter-plugins.org/wiki/RespTimePercentiles/ <https://jmeter-plugins.org/wiki/RespTimePercentiles/> https://jmeter-plugins.org/wiki/JMeterPluginsCMD/ <https://jmeter-plugins.org/wiki/JMeterPluginsCMD/> Here is the command. It processes the previous JTL output file and puts the result in test.csv. java -Xmx2g -jar CMDRunner.jar --tool Reporter --generate-csv ${prev_dir}/${test} \ --input-jtl ${prev_dir}/${out} --plugin-type ResponseTimesPercentiles \ >> $logfile 2>&1 The script prints a summary of the run. I need to fix that to also print out the header for the columns. pct25=`grep "^25.0," ${test} | cut -d , -f 2-` median=`grep "^50.0," ${test} | cut -d , -f 2-` pct75=`grep "^75.0," ${test} | cut -d , -f 2-` pct90=`grep "^90.0," ${test} | cut -d , -f 2-` pct95=`grep "^95.0," ${test} | cut -d , -f 2-` echo `date` ": 25th percentiles are $pct25" echo `date` ": medians are $median" echo `date` ": 75th percentiles are $pct75" echo `date` ": 90th percentiles are $pct90" echo `date` ": 95th percentiles are $pct95" echo `date` ": full results are in ${test}" wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2017, at 12:00 PM, Davis, Daniel (NIH/NLM) [C] > wrote: > > Walter, > > If you can share a pointer to that JMeter add-on, I'd love it. > > -Original Message- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Friday, April 28, 2017 2:53 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Query Performance benchmarking > > I use production logs to get a mix of common and long-tail queries. It is > very hard to get a realistic distribution with synthetic queries. > > A benchmark run goes like this, with a big shell script driving it. > > 1. Reload the collection to clear caches. > 2. Split the log into a cache warming set (usually the first 2000 queries) > and the rest. > 3. Run the warming set with four threads and no delay. This gets it done but > usually does not overload the server. > 4. Run the test set with hundreds of threads, each set for a particular rate. > The overall config is usually between 2000 and 10,000 requests per minute. > 5. Tests run for 1-2 hours. > 6. Grep the results for non-200 responses, filter them out, and report. > 7. Post process the results to make a CSV file of the percentile response > times, one column for each request handler. > > The benchmark driver is a headless JMeter, run with two different config > files (warming and test). The post processing is a JMeter add-on. > > If the CPU gets over about 60% or the run queue gets to about the number of > processors, the hosts are near congestion. The response time will spike if it > is pushed harder than that. > > Prod logs are usually from a few hours of peak traffic during the daytime. > This reduces the amount of bot traffic in the logs. I filter out load > balancer health checks, Zabbix checks, and so on. I like to get a log of a > million queries. That might require grabbing pen traffic logs from several > days. > > With the master/slave cluster, I use logs from a single slave. Those will > have a lower cache hit rate because the requests are randomly spread out. For > our Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive! > > There a script in the JMeter config to make /handler and /select?qt=/handler > get reported as the same thing. Thank you SolrJ. > > Our SLAs are for 95th percentile. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Apr 28, 2017, at 11:39 AM, Erick Erickson wrote: >> >> Well, the best way to get no cache hits is to set the cache sizes to >> zero ;). That provides worst-case scenarios and tells you exactly how >> much you're relying on caches. I'm not talking the lower-level Lucene >> caches here. >> >> One thing I've done is use the TermsComponent to generate a list of >> terms actually in my corpus, and save them away "somewhere" to >> substitute into my queries. The problem with that is when you have >> anything except very simple queries involving AND, you generate >> unrealistic queries when you substitute in random values; you can be >> asking for totally unrelated terms and especially on short fields that >> leads to lots of 0-hit queries which are also unrealistic. >> >> So you get into a long cycle of generating a b
RE: Solr Query Performance benchmarking
Walter, If you can share a pointer to that JMeter add-on, I'd love it. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Friday, April 28, 2017 2:53 PM To: solr-user@lucene.apache.org Subject: Re: Solr Query Performance benchmarking I use production logs to get a mix of common and long-tail queries. It is very hard to get a realistic distribution with synthetic queries. A benchmark run goes like this, with a big shell script driving it. 1. Reload the collection to clear caches. 2. Split the log into a cache warming set (usually the first 2000 queries) and the rest. 3. Run the warming set with four threads and no delay. This gets it done but usually does not overload the server. 4. Run the test set with hundreds of threads, each set for a particular rate. The overall config is usually between 2000 and 10,000 requests per minute. 5. Tests run for 1-2 hours. 6. Grep the results for non-200 responses, filter them out, and report. 7. Post process the results to make a CSV file of the percentile response times, one column for each request handler. The benchmark driver is a headless JMeter, run with two different config files (warming and test). The post processing is a JMeter add-on. If the CPU gets over about 60% or the run queue gets to about the number of processors, the hosts are near congestion. The response time will spike if it is pushed harder than that. Prod logs are usually from a few hours of peak traffic during the daytime. This reduces the amount of bot traffic in the logs. I filter out load balancer health checks, Zabbix checks, and so on. I like to get a log of a million queries. That might require grabbing pen traffic logs from several days. With the master/slave cluster, I use logs from a single slave. Those will have a lower cache hit rate because the requests are randomly spread out. For our Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive! There a script in the JMeter config to make /handler and /select?qt=/handler get reported as the same thing. Thank you SolrJ. Our SLAs are for 95th percentile. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2017, at 11:39 AM, Erick Erickson wrote: > > Well, the best way to get no cache hits is to set the cache sizes to > zero ;). That provides worst-case scenarios and tells you exactly how > much you're relying on caches. I'm not talking the lower-level Lucene > caches here. > > One thing I've done is use the TermsComponent to generate a list of > terms actually in my corpus, and save them away "somewhere" to > substitute into my queries. The problem with that is when you have > anything except very simple queries involving AND, you generate > unrealistic queries when you substitute in random values; you can be > asking for totally unrelated terms and especially on short fields that > leads to lots of 0-hit queries which are also unrealistic. > > So you get into a long cycle of generating a bunch of queries and > removing all queries with less than N hits when you run them. Then > generating more. Then... And each time you pick N, it introduces > another layer of not-real-world possibly. > > Sometimes it's the best you can do, but if you can cull real-world > applications it's _much_ better. Once you have a bunch (I like 10,000) > you can be pretty confident. I not only like to run them randomly, but > I also like to sub-divide them into N buckets and then run each bucket > in order on the theory that that mimics what users actually did, they > don't usually just do stuff at random. Any differences between the > random and non-random runs can give interesting information. > > Best, > Erick > > On Fri, Apr 28, 2017 at 9:38 AM, Rick Leir wrote: >> (aside: Using Gatling or Jmeter?) >> >> Question: How can you easily randomize something in the query so you get no >> cache hits? I think there are several levels of caching. >> >> -- >> Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Solr Query Performance benchmarking
I use production logs to get a mix of common and long-tail queries. It is very hard to get a realistic distribution with synthetic queries. A benchmark run goes like this, with a big shell script driving it. 1. Reload the collection to clear caches. 2. Split the log into a cache warming set (usually the first 2000 queries) and the rest. 3. Run the warming set with four threads and no delay. This gets it done but usually does not overload the server. 4. Run the test set with hundreds of threads, each set for a particular rate. The overall config is usually between 2000 and 10,000 requests per minute. 5. Tests run for 1-2 hours. 6. Grep the results for non-200 responses, filter them out, and report. 7. Post process the results to make a CSV file of the percentile response times, one column for each request handler. The benchmark driver is a headless JMeter, run with two different config files (warming and test). The post processing is a JMeter add-on. If the CPU gets over about 60% or the run queue gets to about the number of processors, the hosts are near congestion. The response time will spike if it is pushed harder than that. Prod logs are usually from a few hours of peak traffic during the daytime. This reduces the amount of bot traffic in the logs. I filter out load balancer health checks, Zabbix checks, and so on. I like to get a log of a million queries. That might require grabbing pen traffic logs from several days. With the master/slave cluster, I use logs from a single slave. Those will have a lower cache hit rate because the requests are randomly spread out. For our Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive! There a script in the JMeter config to make /handler and /select?qt=/handler get reported as the same thing. Thank you SolrJ. Our SLAs are for 95th percentile. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2017, at 11:39 AM, Erick Erickson wrote: > > Well, the best way to get no cache hits is to set the cache sizes to > zero ;). That provides worst-case scenarios and tells you exactly how > much you're relying on caches. I'm not talking the lower-level Lucene > caches here. > > One thing I've done is use the TermsComponent to generate a list of > terms actually in my corpus, and save them away "somewhere" to > substitute into my queries. The problem with that is when you have > anything except very simple queries involving AND, you generate > unrealistic queries when you substitute in random values; you can be > asking for totally unrelated terms and especially on short fields that > leads to lots of 0-hit queries which are also unrealistic. > > So you get into a long cycle of generating a bunch of queries and > removing all queries with less than N hits when you run them. Then > generating more. Then... And each time you pick N, it introduces > another layer of not-real-world possibly. > > Sometimes it's the best you can do, but if you can cull real-world > applications it's _much_ better. Once you have a bunch (I like 10,000) > you can be pretty confident. I not only like to run them randomly, but > I also like to sub-divide them into N buckets and then run each bucket > in order on the theory that that mimics what users actually did, they > don't usually just do stuff at random. Any differences between the > random and non-random runs can give interesting information. > > Best, > Erick > > On Fri, Apr 28, 2017 at 9:38 AM, Rick Leir wrote: >> (aside: Using Gatling or Jmeter?) >> >> Question: How can you easily randomize something in the query so you get no >> cache hits? I think there are several levels of caching. >> >> -- >> Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Solr Query Performance benchmarking
Shawn Heisey wrote: > Adding more shards as Toke suggested *might* help,[...] I seem to have phrased my suggestion poorly. What I meant to suggest was a switch to a single shard (with 4 replicas) setup, instead of the current 2 shards (with 2 replicas). - Toke
Re: Solr Query Performance benchmarking
Well, the best way to get no cache hits is to set the cache sizes to zero ;). That provides worst-case scenarios and tells you exactly how much you're relying on caches. I'm not talking the lower-level Lucene caches here. One thing I've done is use the TermsComponent to generate a list of terms actually in my corpus, and save them away "somewhere" to substitute into my queries. The problem with that is when you have anything except very simple queries involving AND, you generate unrealistic queries when you substitute in random values; you can be asking for totally unrelated terms and especially on short fields that leads to lots of 0-hit queries which are also unrealistic. So you get into a long cycle of generating a bunch of queries and removing all queries with less than N hits when you run them. Then generating more. Then... And each time you pick N, it introduces another layer of not-real-world possibly. Sometimes it's the best you can do, but if you can cull real-world applications it's _much_ better. Once you have a bunch (I like 10,000) you can be pretty confident. I not only like to run them randomly, but I also like to sub-divide them into N buckets and then run each bucket in order on the theory that that mimics what users actually did, they don't usually just do stuff at random. Any differences between the random and non-random runs can give interesting information. Best, Erick On Fri, Apr 28, 2017 at 9:38 AM, Rick Leir wrote: > (aside: Using Gatling or Jmeter?) > > Question: How can you easily randomize something in the query so you get no > cache hits? I think there are several levels of caching. > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Solr Query Performance benchmarking
(aside: Using Gatling or Jmeter?) Question: How can you easily randomize something in the query so you get no cache hits? I think there are several levels of caching. -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Solr Query Performance benchmarking
re: the q vs. fq question. My claim (not verified) is that the fastest of all would be q=*:*&fq={!cache=false}. That would bypass the scoring that putting it in the "q" clause would entail as well as bypass the filter cache. But I have to agree with Walter, this is very suspicious IMO. Here's what I'd do: Change my solrconfig to have a cache size so that both queryResultCache and filterCache that was significantly smaller than the number of queries I was cycling through for my stress test. If you really want to have a worst-case scenario, set the sizes to zero. If that _still_ gives you responses in the 30-40ms range you're in great shape. I suspect Walter and I would be on the same side of a bet that this won't be true. I once worked with a client who was thrilled that their QTimes were 3ms. They were firing the same query over and over Which reinforces Walter's point. Best, Erick On Fri, Apr 28, 2017 at 7:43 AM, Walter Underwood wrote: > More “unrealistic” than “amazing”. I bet the set of test queries is smaller > than the query result cache size. > > Results from cache are about 2 ms, but network communication to the shards > would add enough overhead to reach 40 ms. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Apr 28, 2017, at 5:59 AM, Shawn Heisey wrote: >> >> On 4/27/2017 5:20 PM, Suresh Pendap wrote: >>> Max throughput that I get: 12000 to 12500 reqs/sec >>> 95 percentile query latency: 30 to 40 msec >> >> These numbers are *amazing* ... far better than I would have expected to >> see on a 27GB index, even in a situation where it fits entirely into >> available memory. I would only expect to see a few hundred requests per >> second, maybe as much as several hundred. Congratulationsare definitely >> deserved. >> >> Adding more shards as Toke suggested *might* help, but it might also >> lower performance. More shards means that a single query from the >> user's perspective becomes more queries in the background. Unless you >> add servers to the cloud to handle the additional shards, more shards >> will usually slow things down on an index with a high query rate. On >> indexes with a very low query rate, more shards on the same hardware is >> likely to be faster, because there will be plenty of idle CPU capacity. >> >> What Toke said about filter queries is right on the money. Uncached >> filter queries are pretty expensive. Once a filter gets cached, it is >> SUPER fast ... but if you are constantly changing the filter query, then >> it is unlikely that new filters will be cached. >> >> When a particular query does not appear in either the queryResultCache >> or the filterCache, running it as a clause on the q parameter will >> usually be faster than running it as an fq parameter. If that exact >> query text will be used a LOT, then it makes sense to put it into a >> filter, where it will become very fast once it is cached. >> >> Thanks, >> Shawn >> >
Re: Solr Query Performance benchmarking
More “unrealistic” than “amazing”. I bet the set of test queries is smaller than the query result cache size. Results from cache are about 2 ms, but network communication to the shards would add enough overhead to reach 40 ms. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2017, at 5:59 AM, Shawn Heisey wrote: > > On 4/27/2017 5:20 PM, Suresh Pendap wrote: >> Max throughput that I get: 12000 to 12500 reqs/sec >> 95 percentile query latency: 30 to 40 msec > > These numbers are *amazing* ... far better than I would have expected to > see on a 27GB index, even in a situation where it fits entirely into > available memory. I would only expect to see a few hundred requests per > second, maybe as much as several hundred. Congratulationsare definitely > deserved. > > Adding more shards as Toke suggested *might* help, but it might also > lower performance. More shards means that a single query from the > user's perspective becomes more queries in the background. Unless you > add servers to the cloud to handle the additional shards, more shards > will usually slow things down on an index with a high query rate. On > indexes with a very low query rate, more shards on the same hardware is > likely to be faster, because there will be plenty of idle CPU capacity. > > What Toke said about filter queries is right on the money. Uncached > filter queries are pretty expensive. Once a filter gets cached, it is > SUPER fast ... but if you are constantly changing the filter query, then > it is unlikely that new filters will be cached. > > When a particular query does not appear in either the queryResultCache > or the filterCache, running it as a clause on the q parameter will > usually be faster than running it as an fq parameter. If that exact > query text will be used a LOT, then it makes sense to put it into a > filter, where it will become very fast once it is cached. > > Thanks, > Shawn >
Re: Solr Query Performance benchmarking
On 4/27/2017 5:20 PM, Suresh Pendap wrote: > Max throughput that I get: 12000 to 12500 reqs/sec > 95 percentile query latency: 30 to 40 msec These numbers are *amazing* ... far better than I would have expected to see on a 27GB index, even in a situation where it fits entirely into available memory. I would only expect to see a few hundred requests per second, maybe as much as several hundred. Congratulationsare definitely deserved. Adding more shards as Toke suggested *might* help, but it might also lower performance. More shards means that a single query from the user's perspective becomes more queries in the background. Unless you add servers to the cloud to handle the additional shards, more shards will usually slow things down on an index with a high query rate. On indexes with a very low query rate, more shards on the same hardware is likely to be faster, because there will be plenty of idle CPU capacity. What Toke said about filter queries is right on the money. Uncached filter queries are pretty expensive. Once a filter gets cached, it is SUPER fast ... but if you are constantly changing the filter query, then it is unlikely that new filters will be cached. When a particular query does not appear in either the queryResultCache or the filterCache, running it as a clause on the q parameter will usually be faster than running it as an fq parameter. If that exact query text will be used a LOT, then it makes sense to put it into a filter, where it will become very fast once it is cached. Thanks, Shawn
Re: Solr Query Performance benchmarking
On Thu, 2017-04-27 at 23:20 +, Suresh Pendap wrote: > Number of Solr Nodes: 4 > Number of shards: 2 > replication-factor: 2 > Index size: 55 GB > Shard/Core size: 27.7 GB > maxConnsPerHost: 1000 The overhead of sharding is not trivial. Your overall index size is fairly small, relative to your hardware. As your latency is (assumedly) fine around 30-40ms and you are chasing query throughput, you should try switching to 1 shard / 4 replica. It should improve your throughput and will not hurt latency much (latency might also improve, but that is more uncertain). > The type of queries are mostly of the below pattern > q=*:*&fl=orderNo,purchaseOrderNos,timestamp,eventName,eventID,_src_&f > q=((orderNo:+AND+purchaseOrderNos: )+OR+(+orderNo: alue>))&sort=eventTimestamp+desc&rows=20&wt=javabin&version=2 That seems a but strange. Why don't you use q instead of fq for the part of your request that changes? -- Toke Eskildsen, Royal Danish Library
Solr Query Performance benchmarking
Hi, I am trying to perform Solr Query performance benchmarking and trying to measure the maximum throughput and latency that I can get from.a given Solr cluster. Following are my configurations Number of Solr Nodes: 4 Number of shards: 2 replication-factor: 2 Index size: 55 GB Shard/Core size: 27.7 GB maxConnsPerHost: 1000 The Solr nodes are VM's with 16 core vCpu and 112GB RAM. The CPU is 1-1 and it is not overcommitted. I am generating query load using a Java client program which fires Solr queries read from a static file. The client java program is using the Apache Http Client library to invoke the queries. I have already configured the client to create 300 max connections. The type of queries are mostly of the below pattern q=*:*&fl=orderNo,purchaseOrderNos,timestamp,eventName,eventID,_src_&fq=((orderNo:+AND+purchaseOrderNos:))&sort=eventTimestamp+desc&rows=20&wt=javabin&version=2 Max throughput that I get: 12000 to 12500 reqs/sec 95 percentile query latency: 30 to 40 msec I am measuring the latency and throughput on the client side in my program. The max throughput that I am able to get (sum of each individual clients throughput) is 12000 reqs/sec. I am running with 4 clients each with 50 threads. Even if I increase the number of clients, the throughput still seems to be the same. It seems like I am hitting the maximum capacity of the cluster or some other limit due to which I am not able to put more stress on the server. My CPU is hitting 60% to 70%. I have not been able to increase the CPU usage more than this even when increasing client threads or generating load with more client nodes. The memory used is around 16% on all the nodes except on one node I am seeing the memory used is 41%. There is hardly any IO happening as it is a read test. I am wondering what is limiting my throughput, is there some internal thread pool limit that I am hitting due to which I am not able to increase my CPU/memory usage? My JVM settings are provided below. I am using G1GC and -DSTOP.KEY=solrrocks -DSTOP.PORT=7983 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=13001 -Dcom.sun.management.jmxremote.rmi.port=13001 -Dcom.sun.management.jmxremote.ssl=false -Djetty.home=/app/solr6/server -Djetty.port=8983 -Dlog4j.configuration=file: -Dsolr.autoSoftCommit.maxTime=5000 -Dsolr.autoSoftCommit.minTime=5000 -Dsolr.install.dir=/app/solr6 -Dsolr.log.dir=/app/solrdata6/logs -Dsolr.log.muteconsole -Dsolr.solr.home= -Duser.timezone=UTC -DzkClientTimeout=15000 -DzkHost= -XX:+AlwaysPreTouch -XX:+ResizeTLAB -XX:+UseG1GC -XX:+UseGCLogFileRotation -XX:+UseLargePages -XX:+UseTLAB -XX:-UseBiasedLocking -XX:GCLogFileSize=20M -XX:MaxGCPauseMillis=50 -XX:NumberOfGCLogFiles=9 -XX:OnOutOfMemoryError=/app/solr6/bin/oom_solr.sh -Xloggc: -Xms11g -Xmx11g -Xss256k -verbose:gc I have not customized the Solr Cache values. The DocumentCache, QueryResultCache, FieldValueCache everything is using default values. I read in one of the SolrPerformance documents that it is better to leave more memory to the Operating system and utilize the OS buffer cache. Is it the best query throughput that I can extract from this sized cluster and index size combination? Any ideas is highly appreciated. Thanks Suresh
RE: DataImportHandler | Query | performance
Thanks a lot Shawn. Regards, Prateek Jain -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: 23 December 2016 01:36 PM To: solr-user@lucene.apache.org Subject: Re: DataImportHandler | Query | performance On 12/23/2016 5:15 AM, Prateek Jain J wrote: > We need some advice/views on the way we push our documents in SOLR (4.8.1). > So, here are the requirements: > > 1. Document could be from 5 to 100 KB in size. > > 2. 10-50 users actively querying solr with different sort of data. > > 3. Data will be available frequently to be pushed to solr (streaming). > It must be available with-in 15 seconds to be queried. > > Current scenario: > We dump data to a json file and have a cron job (in java, each time a > new file is created) which reads this file periodically and sends it to SOLR > using solrj (via http). This file is massive and could be of size ~GBs in > some cases (soft and hard solr commits are configured appropriately). > > Issue: > > 1. Multiple cores exist in this SOLR and they too follow similar > pattern. > > 2. This causes SOLR to hang and cause OOM in some cases due to, too > many FIleDescriptors opened (sometimes, due to other issues) > > We would like to know if using DataImportHandler give us any advantage? I > just gave a quick glance on Solr Wiki but not clear if it offers any > advantages in terms of performance (in this scenario). If you do find a way to do this with DIH, it might make your "too many open files" problems *worse*, not better. Currently these files you are talking about are being handled by a completely separate process, not Solr. If you move this inside Solr, then Solr will open *more* files. Your SolrJ program should read the files and construct SolrInputDocument objects, then send them in batches to Solr. It should not send massive files directly. That might fix the OOM issues, or it might not -- if not, then your Solr machine needs a larger heap. To deal with the open files problem, you're going to have to fiddle with the operating system to allow it to open more files. DIH has limitations that frequently make it necessary for users to write their own programs to do indexing. Since you already have an external process, you should improve that, rather than trying to use DIH. Thanks, Shawn
Re: DataImportHandler | Query | performance
On 12/23/2016 5:15 AM, Prateek Jain J wrote: > We need some advice/views on the way we push our documents in SOLR (4.8.1). > So, here are the requirements: > > 1. Document could be from 5 to 100 KB in size. > > 2. 10-50 users actively querying solr with different sort of data. > > 3. Data will be available frequently to be pushed to solr (streaming). > It must be available with-in 15 seconds to be queried. > > Current scenario: > We dump data to a json file and have a cron job (in java, each time a > new file is created) which reads this file periodically and sends it to SOLR > using solrj (via http). This file is massive and could be of size ~GBs in > some cases (soft and hard solr commits are configured appropriately). > > Issue: > > 1. Multiple cores exist in this SOLR and they too follow similar > pattern. > > 2. This causes SOLR to hang and cause OOM in some cases due to, too > many FIleDescriptors opened (sometimes, due to other issues) > > We would like to know if using DataImportHandler give us any advantage? I > just gave a quick glance on Solr Wiki but not clear if it offers any > advantages in terms of performance (in this scenario). If you do find a way to do this with DIH, it might make your "too many open files" problems *worse*, not better. Currently these files you are talking about are being handled by a completely separate process, not Solr. If you move this inside Solr, then Solr will open *more* files. Your SolrJ program should read the files and construct SolrInputDocument objects, then send them in batches to Solr. It should not send massive files directly. That might fix the OOM issues, or it might not -- if not, then your Solr machine needs a larger heap. To deal with the open files problem, you're going to have to fiddle with the operating system to allow it to open more files. DIH has limitations that frequently make it necessary for users to write their own programs to do indexing. Since you already have an external process, you should improve that, rather than trying to use DIH. Thanks, Shawn
DataImportHandler | Query | performance
Hi All, We need some advice/views on the way we push our documents in SOLR (4.8.1). So, here are the requirements: 1. Document could be from 5 to 100 KB in size. 2. 10-50 users actively querying solr with different sort of data. 3. Data will be available frequently to be pushed to solr (streaming). It must be available with-in 15 seconds to be queried. Current scenario: We dump data to a json file and have a cron job (in java, each time a new file is created) which reads this file periodically and sends it to SOLR using solrj (via http). This file is massive and could be of size ~GBs in some cases (soft and hard solr commits are configured appropriately). Issue: 1. Multiple cores exist in this SOLR and they too follow similar pattern. 2. This causes SOLR to hang and cause OOM in some cases due to, too many FIleDescriptors opened (sometimes, due to other issues) We would like to know if using DataImportHandler give us any advantage? I just gave a quick glance on Solr Wiki but not clear if it offers any advantages in terms of performance (in this scenario). Regards, Prateek Jain
Re: facet query performance
On Mon, 2016-11-14 at 11:36 +0530, Midas A wrote: > How to improve facet query performance 1) Don't shard unless you really need to. Replicas are fine. 2) If the problem is the first facet call, then enable DocValues and re-index. 3) Keep facet.limit <= 100, especially if you shard. and most important 4) Describe in detail what you have, how you facet and what you expect. Give us something to work. - Toke Eskildsen, State and University Library, Denmark
facet query performance
How to improve facet query performance
Re: Poor Solr Cloud Query Performance against a Small Dataset
Good tip Rick, I'll dig in and make sure everything is set up correctly. Thanks! -D Dave Seltzer Chief Systems Architect TVEyes (203) 254-3600 x222 On Wed, Nov 2, 2016 at 9:05 PM, Rick Leir wrote: > Here is a wild guess. Whenever I see a 5 second delay in networking, I > think DNS timeouts. YMMV, good luck. > > cheers -- Rick > > On 2016-11-01 04:18 PM, Dave Seltzer wrote: > >> Hello! >> >> I'm trying to utilize Solr Cloud to help with a hash search problem. The >> record set has only 4,300 documents. >> >> When I run my search against a single core I get results on the order of >> 10ms. When I run the same search against Solr Cloud results take about >> 5,000 ms. >> >> Is there something about this particular query which makes it perform >> poorly in a Cloud environment? The query looks like this (linebreaks added >> for readability): >> >> {!frange+l%3D5+u%3D25}sum( >> termfreq(hashTable_0,'225706351'), >> termfreq(hashTable_1,'17664000'), >> termfreq(hashTable_2,'86447642'), >> termfreq(hashTable_3,'134816033'), >> > >
Re: Poor Solr Cloud Query Performance against a Small Dataset
Here is a wild guess. Whenever I see a 5 second delay in networking, I think DNS timeouts. YMMV, good luck. cheers -- Rick On 2016-11-01 04:18 PM, Dave Seltzer wrote: Hello! I'm trying to utilize Solr Cloud to help with a hash search problem. The record set has only 4,300 documents. When I run my search against a single core I get results on the order of 10ms. When I run the same search against Solr Cloud results take about 5,000 ms. Is there something about this particular query which makes it perform poorly in a Cloud environment? The query looks like this (linebreaks added for readability): {!frange+l%3D5+u%3D25}sum( termfreq(hashTable_0,'225706351'), termfreq(hashTable_1,'17664000'), termfreq(hashTable_2,'86447642'), termfreq(hashTable_3,'134816033'),
Poor Solr Cloud Query Performance against a Small Dataset
Hello! I'm trying to utilize Solr Cloud to help with a hash search problem. The record set has only 4,300 documents. When I run my search against a single core I get results on the order of 10ms. When I run the same search against Solr Cloud results take about 5,000 ms. Is there something about this particular query which makes it perform poorly in a Cloud environment? The query looks like this (linebreaks added for readability): {!frange+l%3D5+u%3D25}sum( termfreq(hashTable_0,'225706351'), termfreq(hashTable_1,'17664000'), termfreq(hashTable_2,'86447642'), termfreq(hashTable_3,'134816033'), termfreq(hashTable_4,'1061820218'), termfreq(hashTable_5,'543627850'), termfreq(hashTable_6,'-1828379348'), termfreq(hashTable_7,'423236759'), termfreq(hashTable_8,'522192943'), termfreq(hashTable_9,'572537937'), termfreq(hashTable_10,'286991887'), termfreq(hashTable_11,'789711386'), termfreq(hashTable_12,'235801909'), termfreq(hashTable_13,'67109911'), termfreq(hashTable_14,'609628285'), termfreq(hashTable_15,'1796472850'), termfreq(hashTable_16,'202312085'), termfreq(hashTable_17,'306200840'), termfreq(hashTable_18,'85657669'), termfreq(hashTable_19,'671548727'), termfreq(hashTable_20,'71309060'), termfreq(hashTable_21,'1125848323'), termfreq(hashTable_22,'1077548043'), termfreq(hashTable_23,'117638159'), termfreq(hashTable_24,'-1408039642')) The schema looks like this: subFingerprintId I've included some sample output below. I wasn't sure if this was a matter of changing the routing key in the collections system, or if this is a more fundamental problem with the way Term Frequencies are counted in a Solr Cloud environment. Many thanks! -Dave -- Single Core Example Query: { "responseHeader":{ "status":0, "QTime":13, "params":{ "q":"{!frange l=5 u=25}sum(termfreq(hashTable_0,'354749018'),termfreq(hashTable_1,'286534657'),termfreq(hashTable_2,'1798007322'),termfreq(hashTable_3,'151854851'),termfreq(hashTable_4,'142869766'),termfreq(hashTable_5,'240584768'),termfreq(hashTable_6,'68120837'),termfreq(hashTable_7,'134945863'),termfreq(hashTable_8,'688067644'),termfreq(hashTable_9,'621220625'),termfreq(hashTable_10,'1732446991'),termfreq(hashTable_11,'505547282'),termfreq(hashTable_12,'135990559'),termfreq(hashTable_13,'123097623'),termfreq(hashTable_14,'454174225'),termfreq(hashTable_15,'788988675'),termfreq(hashTable_16,'53480196'),termfreq(hashTable_17,'487550779'),termfreq(hashTable_18,'455477045'),termfreq(hashTable_19,'1141310997'),termfreq(hashTable_20,'71322652'),termfreq(hashTable_21,'805503533'),termfreq(hashTable_22,'656158000'),termfreq(hashTable_23,'302410303'),termfreq(hashTable_24,'194970957'))", "indent":"on", "wt":"json", "debugQuery":"on", "_":"1478024378680"}}, "response":{"numFound":1,"start":0,"docs":[ { "subFingerprintId":"f6c9093e-e8e9-4c0f-aa2a-387b46e7ef2a", "trackId":"5207095a-0126-4c41-8787-16d41165158a", "sequenceNumber":136, "sequenceAt":12.5399129172714, "hashTable_0":354749018, "hashTable_1":287779841, "hashTable_2":1797994010, "hashTable_3":151854851, "hashTable_4":375260422, "hashTable_5":441911360, "hashTable_6":68120837, "hashTable_7":420158535, "hashTable_8":16979004, "hashTable_9":1443304209, "hashTable_10":1732468239, "hashTable_11":455215642, "hashTable_12":135990559, "hashTable_13":123093271, "hashTable_14":1444029969, "hashTable_15":788988675, "hashTable_16":53480196, "hashTable_17":488255035, "hashTable_18":505809973, "hashTable_19":201814293, "hashTable_20":70208520, "hashTable_21":805503541, "hashTable_22":658713904, "hashTable_23":302387775, "hashTable_24":194970957, "_version_":1549818240561053696}] }, "debug":{ "rawquerystring":"{!frange l=5 u=25}sum(termfreq(hashTable_0,'354749018'),termfreq(hashTable_1,'286534657'),termfreq(hashTable_2,'1798007322'),termfreq(hashTable_3,'151854851'),termfreq(hashTable_4,'142869766'),termfreq(hashTable_5,'240584768'),termfreq(hashTable_6,'68120837'),termfreq(hashTable_7,'134945863'),termfreq(hashTable_8,'688067644'),termfreq(hashTable_9,'621220625'),termfreq(hashTable_10,'1732446991'),termfreq(hashTable_11,'505547282'),termfreq(hashTable_12,'135990559'),termfreq(hashTable_13,'123097623'),termfreq(hashTable_14,'454174225'),termfreq(hashTable_15,'788988675'),termfreq(hashTable_16,'53480196'),termfreq(hashTable_17,'487550779'),termfreq(hashTable_18,'455477045'),termfreq(hashTable_19,'1141310997'),termfreq(hashTable_20,'71322652'),termfreq(hashTable_21,'805503533'),termfreq(hashTable_22,'656158000'),termfreq(hashTable_23,'302410303'),termfreq(hashTable_24,'194970957'))", "querystring":"{!frange l=5
Multi-core query performance tuning/monitoring
Hi I have a few filter queries that use multiple cores join to filter documents. After I inverted those joins they became slower. So, it looks something like that: I used to query "product" core with query that contains fq={!join to=tags from=preferred_tags fromIndex=user}(country:US AND ...)&fq=product_category:0&... Now I query "user" core with query that contains fq={!join to=preferred_tags from=tags fromIndex=product}(product_category:0 AND ...)&fq=country:US&... Both tags and preferred_tags might contain multiple values and "product" core is more oftenly used(so could be that the cache is warmer for that core). "user" index is smaller then "product". After a few queries Solr seems to warm up and serves the query ~50x faster, but the initial queries are extremely slow. I tried turning off caching for the filter and making it's cost higher then 150, but it did not help much. I was thinking about adding autowarmup queries, but first I want to check what makes the join so slow, so what would be a right way to debug it to see which part of it is the slowest one... Also, if I will go with autowarmup since there are 2 cores involved I wonder which warmup query should be used... "fq={!join to=preferred_tags from=tags fromIndex=product}(product_category:0 AND ...)" on "user" core or "fq=(product_category:0 AND ...)" on "product"... Solr version is 4.3.0 Regards, Oleg
Re: Effects of insert order on query performance
Thanks Emir. I’m unfortunately already using a routing key that needs to be at the top level, since I’m collapsing on that field. Adding a sub-key won’t help much if my theory is correct, as even a single shard (distrib=false) showed serious performance degradation, and query latency is the max(shard latency). I’d need a routing scheme that assured that a given shard has *only* A’s, or *only* B’s. Even if I could use “permissions” as the top-level routing key though, this is a very low cardinality field, so I’d expect to end up with very large differences between the sizes of the shards in that case. That’s fine from a SolrCloud query perspective of course, but it makes for more difficult resource provisioning. On 8/12/16, 1:39 AM, "Emir Arnautovic" wrote: Hi Jeff, I will not comment on your theory (will let that to guys more familiar with Lucene code) but will point to one alternative solution: routing. You can use routing to split documents with different permission to different shards and use composite hash routing to split "A" (and maybe "B" as well) documents to multiple shards. That will make sure all doc with the same permission are on the same shard and on query time only those will be queried (less shards to query) and there is no need to include term query or filter query at all. Here is blog explaining benefits of composite hash routing: https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/ Regards, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On 11.08.2016 19:39, Jeff Wartes wrote: > This isn’t really a question, although some validation would be nice. It’s more of a warning. > > Tldr is that the insert order of documents in my collection appears to have had a huge effect on my query speed. > > > I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued field (“permissions”) that for 90% of docs contains one particular value, (“A”) and for 10% of docs contains another distinct value. (“B”) It’s intended to represent something like permissions, so more values are possible in the future, but not present currently. In fact, the addition of docs with value B to this index was very recent, previously all docs had value “A”. All queries, in addition to various other Boolean-query type restrictions, have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions v=A,B} > > Last week, I tried to re-index the whole collection from scratch, using source data. Query performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous query throughput, and even that was at latencies that were orders of magnitude higher than what I had in production. > > I hooked up some CPU profiling to a server that had shards from both the old and new version of the collection, and eventually it looked like the significant difference in processing the two collections was coming from ConstantWeight.scorer() > Specifically, this line > https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102 > was far more expensive in my re-indexed collection. From there, the call chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here: > https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169 > > I don’t pretend to understand all that code, but the difference in my re-index appears to have something to do either with that cache, or the aggregate docIdSets that need weights generated is simply much bigger in my re-index. > > > But the queries didn’t change, and the data is basically the same, what else could have changed? > > The documents with the “B” distinct value were added recently to the high-performance collection, but the A’s and the B’s were all mixed up in the source data dump I used to re-index. On a hunch, I manually ordered the docs such that the A’s were all first and re-indexed again, and performance is great! > > Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index are contained in the largest segments. I’m guessing there’s an optimization somewhere that says something like “This segment only has A’s”. By indexing all the A’s first, those biggest segments only contain A’s, and only the smallest, newest segments are unable to make use of that optimization. > > Here’s the scary part: Although my re-
Re: Effects of insert order on query performance
Hi Jeff, I will not comment on your theory (will let that to guys more familiar with Lucene code) but will point to one alternative solution: routing. You can use routing to split documents with different permission to different shards and use composite hash routing to split "A" (and maybe "B" as well) documents to multiple shards. That will make sure all doc with the same permission are on the same shard and on query time only those will be queried (less shards to query) and there is no need to include term query or filter query at all. Here is blog explaining benefits of composite hash routing: https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/ Regards, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On 11.08.2016 19:39, Jeff Wartes wrote: This isn’t really a question, although some validation would be nice. It’s more of a warning. Tldr is that the insert order of documents in my collection appears to have had a huge effect on my query speed. I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued field (“permissions”) that for 90% of docs contains one particular value, (“A”) and for 10% of docs contains another distinct value. (“B”) It’s intended to represent something like permissions, so more values are possible in the future, but not present currently. In fact, the addition of docs with value B to this index was very recent, previously all docs had value “A”. All queries, in addition to various other Boolean-query type restrictions, have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions v=A,B} Last week, I tried to re-index the whole collection from scratch, using source data. Query performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous query throughput, and even that was at latencies that were orders of magnitude higher than what I had in production. I hooked up some CPU profiling to a server that had shards from both the old and new version of the collection, and eventually it looked like the significant difference in processing the two collections was coming from ConstantWeight.scorer() Specifically, this line https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102 was far more expensive in my re-indexed collection. From there, the call chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here: https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169 I don’t pretend to understand all that code, but the difference in my re-index appears to have something to do either with that cache, or the aggregate docIdSets that need weights generated is simply much bigger in my re-index. But the queries didn’t change, and the data is basically the same, what else could have changed? The documents with the “B” distinct value were added recently to the high-performance collection, but the A’s and the B’s were all mixed up in the source data dump I used to re-index. On a hunch, I manually ordered the docs such that the A’s were all first and re-indexed again, and performance is great! Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index are contained in the largest segments. I’m guessing there’s an optimization somewhere that says something like “This segment only has A’s”. By indexing all the A’s first, those biggest segments only contain A’s, and only the smallest, newest segments are unable to make use of that optimization. Here’s the scary part: Although my re-index is now performing well, if this theory is right, some random insert (or a deliberate optimize) at some random point in the future could cascade a segment merge such that the largest segment(s) now contain both A’s and B’s, and performance suddenly goes over a cliff. I have no way to prevent this possibility except to stop doing inserts. My current thinking is that I need to pull the terms-query part out of the query and do a filter query for it instead. Probably as a post-filter, since I’ve had bad luck with very large filter queries and the filter cache. I’d tested this originally (when I only had A’s), but found the performance was a bit worse than just leaving it in the query. I’ll take a bit worse and predictability over a bit better and a time bomb though, if those are my choices. If anyone has any comments refuting or supporting this theory, I’d certainly like to hear it. This is the first time I’ve encountered anything about insert order mattering from a performance perspective, and it becomes a general-form question around how to handle low-cardinality fields.
Effects of insert order on query performance
This isn’t really a question, although some validation would be nice. It’s more of a warning. Tldr is that the insert order of documents in my collection appears to have had a huge effect on my query speed. I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued field (“permissions”) that for 90% of docs contains one particular value, (“A”) and for 10% of docs contains another distinct value. (“B”) It’s intended to represent something like permissions, so more values are possible in the future, but not present currently. In fact, the addition of docs with value B to this index was very recent, previously all docs had value “A”. All queries, in addition to various other Boolean-query type restrictions, have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions v=A,B} Last week, I tried to re-index the whole collection from scratch, using source data. Query performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous query throughput, and even that was at latencies that were orders of magnitude higher than what I had in production. I hooked up some CPU profiling to a server that had shards from both the old and new version of the collection, and eventually it looked like the significant difference in processing the two collections was coming from ConstantWeight.scorer() Specifically, this line https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102 was far more expensive in my re-indexed collection. From there, the call chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here: https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169 I don’t pretend to understand all that code, but the difference in my re-index appears to have something to do either with that cache, or the aggregate docIdSets that need weights generated is simply much bigger in my re-index. But the queries didn’t change, and the data is basically the same, what else could have changed? The documents with the “B” distinct value were added recently to the high-performance collection, but the A’s and the B’s were all mixed up in the source data dump I used to re-index. On a hunch, I manually ordered the docs such that the A’s were all first and re-indexed again, and performance is great! Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index are contained in the largest segments. I’m guessing there’s an optimization somewhere that says something like “This segment only has A’s”. By indexing all the A’s first, those biggest segments only contain A’s, and only the smallest, newest segments are unable to make use of that optimization. Here’s the scary part: Although my re-index is now performing well, if this theory is right, some random insert (or a deliberate optimize) at some random point in the future could cascade a segment merge such that the largest segment(s) now contain both A’s and B’s, and performance suddenly goes over a cliff. I have no way to prevent this possibility except to stop doing inserts. My current thinking is that I need to pull the terms-query part out of the query and do a filter query for it instead. Probably as a post-filter, since I’ve had bad luck with very large filter queries and the filter cache. I’d tested this originally (when I only had A’s), but found the performance was a bit worse than just leaving it in the query. I’ll take a bit worse and predictability over a bit better and a time bomb though, if those are my choices. If anyone has any comments refuting or supporting this theory, I’d certainly like to hear it. This is the first time I’ve encountered anything about insert order mattering from a performance perspective, and it becomes a general-form question around how to handle low-cardinality fields.
Re: SolrCloud - Query performance degrades with multiple servers(Shards)
15M docs may still comfortably fit in a single shard! I've seen up to 300M docs fit on a shard. Then again I've seen 10M docs make things unacceptably slow. You simply cannot extrapolate from 10K to 5M reliably. Put all 5M docs on the stand-alone servers and test _that_. Whenever I see numbers like 30K qps (assuming this is queries, not number of docs indexed) I wonder if you're using the same query over and over and hitting the query result cache rather than doing any actual searches. But to answer your question (again). Sharding adds overhead. There's no way to make that overhead magically disappear. What you measure is what you can expect, and you must measure. Best, Erick On Tue, Jul 19, 2016 at 8:32 AM, Susheel Kumar wrote: > You may want to utilise Document routing (_route_) option to have your > query serve faster but above you are trying to compare apple with oranges > meaning your performance tests numbers have to be based on either your > actual numbers like 3-5 million docs per shard or sufficient enough to see > advantage of using sharding. 10K is nothing for your performance tests and > will not give you anything. > > Otherwise as Eric mentioned don't shard and add replica's if there is no > need to distribute/divide data into shards. > > > See > https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud > > https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options > > > Thanks, > Susheel > > On Tue, Jul 19, 2016 at 1:41 AM, kasimjinwala > wrote: > >> This is just for performance testing we have taken 10K records per shard. >> In >> live scenario it would be 30L-50L per shard. I want to search document from >> all shards, it will slow down and take too long time. >> >> I know in case of solr Cloud, it will query all shard node and then return >> result. Is there any way to search document in all shard with best >> performance(qps) >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html >> Sent from the Solr - User mailing list archive at Nabble.com. >>
Re: SolrCloud - Query performance degrades with multiple servers(Shards)
You may want to utilise Document routing (_route_) option to have your query serve faster but above you are trying to compare apple with oranges meaning your performance tests numbers have to be based on either your actual numbers like 3-5 million docs per shard or sufficient enough to see advantage of using sharding. 10K is nothing for your performance tests and will not give you anything. Otherwise as Eric mentioned don't shard and add replica's if there is no need to distribute/divide data into shards. See https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options Thanks, Susheel On Tue, Jul 19, 2016 at 1:41 AM, kasimjinwala wrote: > This is just for performance testing we have taken 10K records per shard. > In > live scenario it would be 30L-50L per shard. I want to search document from > all shards, it will slow down and take too long time. > > I know in case of solr Cloud, it will query all shard node and then return > result. Is there any way to search document in all shard with best > performance(qps) > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: SolrCloud - Query performance degrades with multiple servers(Shards)
This is just for performance testing we have taken 10K records per shard. In live scenario it would be 30L-50L per shard. I want to search document from all shards, it will slow down and take too long time. I know in case of solr Cloud, it will query all shard node and then return result. Is there any way to search document in all shard with best performance(qps) -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud - Query performance degrades with multiple servers(Shards)
+1 to Susheel's question. Sharding inevitably adds overhead. Roughly each shard is queried for its top N docs (10 if, say, rows=10). The doc ID and sort criteria (score by default) are returned to the node that originally got the request. That node then sorts the lists into the real top 10 to return to the user. Then the node handling the request re-queries the shards for the contents of those docs. Sharding is a way to handle very large data sets, the general recommendation is to shard _only_ when you have too many documents to get good query perf from a single shard. If you need to increase QPS, add _replicas_ not shards. Only go to sharding when you have too many documents fit on your hardware. Best, Erick On Mon, Jul 18, 2016 at 6:31 AM, Susheel Kumar wrote: > Hello, > > Question: Do you really need sharding/can live without sharding since you > mentioned only 10K records in one shard. What's your index/document size? > > Thanks, > Susheel > > On Mon, Jul 18, 2016 at 2:08 AM, kasimjinwala > wrote: > >> currently I am using solrCloud 5.0 and I am facing query performance issue >> while using 3 implicit shards, each shard contain around 10K records. >> when I am specifying shards parameter(*shards=shard1*) in query it gives >> 30K-35K qps. but while removing shards parameter from query it give >> *1000-1500qps*. performance decreases drastically. >> >> please provide comment or suggestion to solve above issue >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html >> Sent from the Solr - User mailing list archive at Nabble.com. >>
Re: SolrCloud - Query performance degrades with multiple servers(Shards)
Hello, Question: Do you really need sharding/can live without sharding since you mentioned only 10K records in one shard. What's your index/document size? Thanks, Susheel On Mon, Jul 18, 2016 at 2:08 AM, kasimjinwala wrote: > currently I am using solrCloud 5.0 and I am facing query performance issue > while using 3 implicit shards, each shard contain around 10K records. > when I am specifying shards parameter(*shards=shard1*) in query it gives > 30K-35K qps. but while removing shards parameter from query it give > *1000-1500qps*. performance decreases drastically. > > please provide comment or suggestion to solve above issue > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: SolrCloud - Query performance degrades with multiple servers(Shards)
currently I am using solrCloud 5.0 and I am facing query performance issue while using 3 implicit shards, each shard contain around 10K records. when I am specifying shards parameter(*shards=shard1*) in query it gives 30K-35K qps. but while removing shards parameter from query it give *1000-1500qps*. performance decreases drastically. please provide comment or suggestion to solve above issue -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: measuring query performance & qps per node
In SolrCloud you can collect stats on pivot facets, see: https://issues.apache.org/jira/browse/SOLR-6351 There are more buckets to count into and in SolrCloud you have extra work to reconcile the partial results from different shards. Best, Erick On Mon, Apr 25, 2016 at 8:50 PM, Jay Potharaju wrote: > Thanks for the response Erick. I knew that it would depend on the number of > factors like you mentioned.I just wanted to know whether a good > combination of queries, facets & filters should be a good estimate of how > solr might behave. > > what did you mean by "Add stats to pivots in Cloud mode." > > Thanks > > On Mon, Apr 25, 2016 at 5:05 PM, Erick Erickson > wrote: > >> Impossible to answer. For instance, a facet query can be very >> heavy-duty. Add stats >> to pivots in Cloud mode. >> >> As for using a bunch of fq clauses, It Depends (tm). If your expected usage >> pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's >> fine. It totally >> falls down if, for instance, you have a bunch of facets. Or grouping. >> Or. >> >> Best, >> Erick >> >> On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju >> wrote: >> > Hi, >> > I am trying to measure how will are queries performing ie how long are >> they >> > taking. In order to measure query speed I am using solrmeter with 50k >> > unique filter queries. And then checking if any of the queries are slower >> > than 50ms. Is this a good approach to measure query performance? >> > >> > Are there any guidelines on how to measure if a given instance can >> handle a >> > given number of qps(query per sec)? For example if my doc size is 30 >> > million docs and index size is 40 GB of data and the RAM on the instance >> is >> > 60 GB, then how many qps can it handle? Or is this a hard question to >> > answer and it depends on the load and type of query running at a given >> time. >> > >> > -- >> > Thanks >> > Jay >> > > > > -- > Thanks > Jay Potharaju
Re: measuring query performance & qps per node
Thanks for the response Erick. I knew that it would depend on the number of factors like you mentioned.I just wanted to know whether a good combination of queries, facets & filters should be a good estimate of how solr might behave. what did you mean by "Add stats to pivots in Cloud mode." Thanks On Mon, Apr 25, 2016 at 5:05 PM, Erick Erickson wrote: > Impossible to answer. For instance, a facet query can be very > heavy-duty. Add stats > to pivots in Cloud mode. > > As for using a bunch of fq clauses, It Depends (tm). If your expected usage > pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's > fine. It totally > falls down if, for instance, you have a bunch of facets. Or grouping. > Or. > > Best, > Erick > > On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju > wrote: > > Hi, > > I am trying to measure how will are queries performing ie how long are > they > > taking. In order to measure query speed I am using solrmeter with 50k > > unique filter queries. And then checking if any of the queries are slower > > than 50ms. Is this a good approach to measure query performance? > > > > Are there any guidelines on how to measure if a given instance can > handle a > > given number of qps(query per sec)? For example if my doc size is 30 > > million docs and index size is 40 GB of data and the RAM on the instance > is > > 60 GB, then how many qps can it handle? Or is this a hard question to > > answer and it depends on the load and type of query running at a given > time. > > > > -- > > Thanks > > Jay > -- Thanks Jay Potharaju
Re: measuring query performance & qps per node
Impossible to answer. For instance, a facet query can be very heavy-duty. Add stats to pivots in Cloud mode. As for using a bunch of fq clauses, It Depends (tm). If your expected usage pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's fine. It totally falls down if, for instance, you have a bunch of facets. Or grouping. Or. Best, Erick On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju wrote: > Hi, > I am trying to measure how will are queries performing ie how long are they > taking. In order to measure query speed I am using solrmeter with 50k > unique filter queries. And then checking if any of the queries are slower > than 50ms. Is this a good approach to measure query performance? > > Are there any guidelines on how to measure if a given instance can handle a > given number of qps(query per sec)? For example if my doc size is 30 > million docs and index size is 40 GB of data and the RAM on the instance is > 60 GB, then how many qps can it handle? Or is this a hard question to > answer and it depends on the load and type of query running at a given time. > > -- > Thanks > Jay
measuring query performance & qps per node
Hi, I am trying to measure how will are queries performing ie how long are they taking. In order to measure query speed I am using solrmeter with 50k unique filter queries. And then checking if any of the queries are slower than 50ms. Is this a good approach to measure query performance? Are there any guidelines on how to measure if a given instance can handle a given number of qps(query per sec)? For example if my doc size is 30 million docs and index size is 40 GB of data and the RAM on the instance is 60 GB, then how many qps can it handle? Or is this a hard question to answer and it depends on the load and type of query running at a given time. -- Thanks Jay
Re: normal solr query vs facet query performance
On 4/18/2016 5:06 AM, Mugeesh Husain wrote: > 1.)solr normal query(q=*:*) vs facet query(facet.query="abc") ? > 2.)solr normal query(q=*:*) vs facet > search(facet=tru&facet.field=coullumn_name) ? > 3.)solr filter query(q=Column:some value) vs facet query(facet.query="abc") > ? > 4.)solr normal query(q=*:*) vs filter query(q=column:some value) ? This is a question that is nearly impossible to answer without your actual index, and even then only you can answer it. You need to *try* these queries and see what happens. https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Note that there is a performance bug with the *:* (MatchAllDocuments) query on 5.x versions, which is only solved in 5.5.0 and later. This query runs quite a bit slower than it should. https://issues.apache.org/jira/browse/SOLR-8251 Thanks, Shawn
normal solr query vs facet query performance
Hello, I am looking for which query will be fast in term of performance, 1.)solr normal query(q=*:*) vs facet query(facet.query="abc") ? 2.)solr normal query(q=*:*) vs facet search(facet=tru&facet.field=coullumn_name) ? 3.)solr filter query(q=Column:some value) vs facet query(facet.query="abc") ? 4.)solr normal query(q=*:*) vs filter query(q=column:some value) ? Also provide some good tutorial for above these things. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/normal-solr-query-vs-facet-query-performance-tp4270907.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Soft commit does not affecting query performance
Hi Bill, Please find below reference. http://www.cloudera.com/documentation/enterprise/5-4-x/topics/search_tuning_solr.html * "Enable soft commits and set the value to the largest value that meets your requirements. The default value of 1000 (1 second) is too aggressive for some environments." Thanks & Regards, Bhaumik Joshi From: billnb...@gmail.com Sent: Monday, April 11, 2016 7:07 AM To: solr-user@lucene.apache.org Subject: Re: Soft commit does not affecting query performance Why do you think it would ? Bill Bell Sent from mobile > On Apr 11, 2016, at 7:48 AM, Bhaumik Joshi wrote: > > Hi All, > > We are doing query performance test with different soft commit intervals. In > the test with 1sec of soft commit interval and 1min of soft commit interval > we didn't notice any improvement in query timings. > > > > We did test with SolrMeter (Standalone java tool for stress tests with Solr) > for 1sec soft commit and 1min soft commit. > > Index stats of test solr cloud: 0.7 million documents and 1 GB index size. > > Solr cloud has 2 shard and each shard has one replica. > > > > Please find below detailed test readings: (all timings are in milliseconds) > > > Soft commit - 1sec > Queries per sec Updates per sec Total Queries > Total Q time Avg Q Time Total Client time > Avg Client time > 1 5 > 100 44340 >443 48834 > 488 > 5 5 > 101 128914 > 1276 143239 1418 > 10 5 > 104 295325 > 2839 330931 3182 > 25 5 > 102 675319 > 6620 793874 7783 > > Soft commit - 1min > Queries per sec Updates per sec Total Queries > Total Q time Avg Q Time Total Client time > Avg Client time > 1 5 > 100 44292 >442 48569 > 485 > 5 5 > 105 131389 > 1251 147174 1401 > 10 5 > 102 299518 > 2936 337748 3311 > 25 5 > 108 742639 > 6876 865222 8011 > > As theory suggests soft commit affects query performance but in my case it > doesn't. Can you put some light on this? > Also suggest if I am missing something here. > > Regards, > Bhaumik Joshi > > > > > > > > > > > [Asite] > > The Hyperloop Station Design Competition - A 48hr design collaboration, from > mid-day, 23rd May 2016. > REGISTER HERE http://www.buildearthlive.com/hyperloop [http://www.buildearthlive.com/resources/images/BuildEarthLiveLogo-Hyperloop-2.png]<http://www.buildearthlive.com/hyperloop> The Hyperloop Station Design Competition - Build Earth Live<http://www.buildearthlive.com/hyperloop> www.buildearthlive.com The Hyperloop Station Design Competition. A 48hr design collaboration, from mid-day,23rd May. > > [Build Earth Live Hyperloop]<http://www.buildearthlive.com/hyperloop> > > [CC Award Winners 2015]
Re: Soft commit does not affecting query performance
Why do you think it would ? Bill Bell Sent from mobile > On Apr 11, 2016, at 7:48 AM, Bhaumik Joshi wrote: > > Hi All, > > We are doing query performance test with different soft commit intervals. In > the test with 1sec of soft commit interval and 1min of soft commit interval > we didn't notice any improvement in query timings. > > > > We did test with SolrMeter (Standalone java tool for stress tests with Solr) > for 1sec soft commit and 1min soft commit. > > Index stats of test solr cloud: 0.7 million documents and 1 GB index size. > > Solr cloud has 2 shard and each shard has one replica. > > > > Please find below detailed test readings: (all timings are in milliseconds) > > > Soft commit - 1sec > Queries per sec Updates per sec Total Queries > Total Q time Avg Q Time Total Client time > Avg Client time > 1 5 > 100 44340 >443 48834 > 488 > 5 5 > 101 128914 > 1276 143239 1418 > 10 5 > 104 295325 > 2839 330931 3182 > 25 5 > 102 675319 > 6620 793874 7783 > > Soft commit - 1min > Queries per sec Updates per sec Total Queries > Total Q time Avg Q Time Total Client time > Avg Client time > 1 5 > 100 44292 >442 48569 > 485 > 5 5 > 105 131389 > 1251 147174 1401 > 10 5 > 102 299518 > 2936 337748 3311 > 25 5 > 108 742639 > 6876 865222 8011 > > As theory suggests soft commit affects query performance but in my case it > doesn't. Can you put some light on this? > Also suggest if I am missing something here. > > Regards, > Bhaumik Joshi > > > > > > > > > > > [Asite] > > The Hyperloop Station Design Competition - A 48hr design collaboration, from > mid-day, 23rd May 2016. > REGISTER HERE http://www.buildearthlive.com/hyperloop > > [Build Earth Live Hyperloop]<http://www.buildearthlive.com/hyperloop> > > [CC Award Winners 2015]
Soft commit does not affecting query performance
Hi All, We are doing query performance test with different soft commit intervals. In the test with 1sec of soft commit interval and 1min of soft commit interval we didn't notice any improvement in query timings. We did test with SolrMeter (Standalone java tool for stress tests with Solr) for 1sec soft commit and 1min soft commit. Index stats of test solr cloud: 0.7 million documents and 1 GB index size. Solr cloud has 2 shard and each shard has one replica. Please find below detailed test readings: (all timings are in milliseconds) Soft commit - 1sec Queries per sec Updates per sec Total Queries Total Q time Avg Q Time Total Client time Avg Client time 1 5 100 44340 443 48834488 5 5 101 128914 1276 143239 1418 10 5 104 295325 2839 330931 3182 25 5 102 675319 6620 793874 7783 Soft commit - 1min Queries per sec Updates per sec Total Queries Total Q time Avg Q Time Total Client time Avg Client time 1 5 100 44292 442 48569485 5 5 105 131389 1251 147174 1401 10 5 102 299518 2936 337748 3311 25 5 108 742639 6876 865222 8011 As theory suggests soft commit affects query performance but in my case it doesn't. Can you put some light on this? Also suggest if I am missing something here. Regards, Bhaumik Joshi [Asite] The Hyperloop Station Design Competition - A 48hr design collaboration, from mid-day, 23rd May 2016. REGISTER HERE http://www.buildearthlive.com/hyperloop [Build Earth Live Hyperloop]<http://www.buildearthlive.com/hyperloop> [CC Award Winners 2015]
Re: Is it a good query performance with this data size ?
Hi Upayavira, I happened to compose individual fq for each field, such as: fq=Gatewaycode:(...)&fq=DestCode:(...)&fq=DateDep:(...)&fq=Duration:(...) It is nice to know that I am not creating unnecessary cache entries since the above method results in minimal carnality as you pointed out. Thank -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223988.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
Yes, you can limit the size of the filter cache, as Erick says, but then, you could just end up with cache churn, where you are constantly re-populating your cache as stuff gets pushed out, only to have to regenerate it again for the next query. Is it possible to decompose these queries into parts? fq=+category:sport +year:2015 could be better expressed as: fq=category:sport fq=year:2015 Instead of resulting in cardinality(category) * cardinality(year) cache entries, you'd have cardinality(category) + cardinality(year). cardinality() here simply means the number of unique values for that field. Upayavira On Wed, Aug 19, 2015, at 05:23 PM, Erick Erickson wrote: > bq: can I limit the size of the three > caches so that the RAM usage will be under control > > That's exactly what the "size" parameter is for. > > As Upayavira says, the rough size of each entry in > the filterCache is maxDocs/8 + (sizeof query string). > > queryResultCache is much smaller per entry, it's > roughly (sizeof entire query) + ((sizeof Java int) * > ) > > is from solrconfig.xml. The point > here is this is rarely very bug unless you make the > queryResultCache huge. > > As for documentResultCache, it's also usually not > very large, it's the (size you declare it) * (average size of a doc). > > Best, > Erick > > On Wed, Aug 19, 2015 at 9:12 AM, wwang525 wrote: > > Hi Upayavira, > > > > Thank you very much for pointing out the potential design issue > > > > The queries will be determined through a configuration by business users. > > There will be limited number of queries every day, and will get executed by > > customers repeatedly. However, business users will change the configurations > > so that new queries will get generated and also will be limited. The change > > can be as frequent as daily or weekly. The project is to supporting daily > > promotional based on fresh index data. > > > > Cumulatively, there can be a lot of different queries. If I still want to > > take the advantage of the filterCache, can I limit the size of the three > > caches so that the RAM usage will be under control? > > > > Thanks > > > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html > > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
bq: can I limit the size of the three caches so that the RAM usage will be under control That's exactly what the "size" parameter is for. As Upayavira says, the rough size of each entry in the filterCache is maxDocs/8 + (sizeof query string). queryResultCache is much smaller per entry, it's roughly (sizeof entire query) + ((sizeof Java int) * ) is from solrconfig.xml. The point here is this is rarely very bug unless you make the queryResultCache huge. As for documentResultCache, it's also usually not very large, it's the (size you declare it) * (average size of a doc). Best, Erick On Wed, Aug 19, 2015 at 9:12 AM, wwang525 wrote: > Hi Upayavira, > > Thank you very much for pointing out the potential design issue > > The queries will be determined through a configuration by business users. > There will be limited number of queries every day, and will get executed by > customers repeatedly. However, business users will change the configurations > so that new queries will get generated and also will be limited. The change > can be as frequent as daily or weekly. The project is to supporting daily > promotional based on fresh index data. > > Cumulatively, there can be a lot of different queries. If I still want to > take the advantage of the filterCache, can I limit the size of the three > caches so that the RAM usage will be under control? > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
Hi Upayavira, Thank you very much for pointing out the potential design issue The queries will be determined through a configuration by business users. There will be limited number of queries every day, and will get executed by customers repeatedly. However, business users will change the configurations so that new queries will get generated and also will be limited. The change can be as frequent as daily or weekly. The project is to supporting daily promotional based on fresh index data. Cumulatively, there can be a lot of different queries. If I still want to take the advantage of the filterCache, can I limit the size of the three caches so that the RAM usage will be under control? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
You say "all of my queries are based upon fq"? Why? How unique are they? Remember, for each fq value, it could end up storing one bit per document in your index. If you have 8m documents, you could end up with a cache usage of 1Mb, for that query alone! Filter queries are primarily designed for queries that are repeated, e.g.: category:sport, where caching gives some advantage. If all of your queries are unique, then move them to the q= parameter, or make them fq={!cache=false}, otherwise you will waste memory storing cached values that are never used, and CPU building and then destroying those cached entries. Upayavira On Wed, Aug 19, 2015, at 02:25 PM, wwang525 wrote: > Hi Erick, > > All my queries are based on fq (filter query). I have to send the > randomly > generated queries to warm up low level lucene cache. > > I went to the more tedious way to warm up low level cache without > utilizing > the three caches by turning off the three caches (set values to zero). > Then, > I send 800 randomly generated request to Solr. The RAM jumped from 500MB > to > 2.5G, and stayed there. > > Then, I test individual queries against Solr. This time, I got very close > response time when I requested the first time, second time, or third > time. > > The results: > > (1) average response time: 803 ms with only one request having a response > time >1 second (1042 ms) > (2) the majority of the time was spent on query, and not on faceting > (730/803 = 90%) > > So the query is the bottleneck. > > I also have an interesting finding: it looks like the fq query works > better > with integer type. I created string type for two properties: DateDep and > Duration since the definition of docValues=true for integer type did not > work with faceted search. There was a time I accidentally used filter > query > with the string type property and I found the query performance degraded > quite a lot. > > Is it generally true that fq works better with integer type ? > > If this is the case, I could create two integer type properties for two > other fq to check if I can boost the performance. > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
Hi Erick, All my queries are based on fq (filter query). I have to send the randomly generated queries to warm up low level lucene cache. I went to the more tedious way to warm up low level cache without utilizing the three caches by turning off the three caches (set values to zero). Then, I send 800 randomly generated request to Solr. The RAM jumped from 500MB to 2.5G, and stayed there. Then, I test individual queries against Solr. This time, I got very close response time when I requested the first time, second time, or third time. The results: (1) average response time: 803 ms with only one request having a response time >1 second (1042 ms) (2) the majority of the time was spent on query, and not on faceting (730/803 = 90%) So the query is the bottleneck. I also have an interesting finding: it looks like the fq query works better with integer type. I created string type for two properties: DateDep and Duration since the definition of docValues=true for integer type did not work with faceted search. There was a time I accidentally used filter query with the string type property and I found the query performance degraded quite a lot. Is it generally true that fq works better with integer type ? If this is the case, I could create two integer type properties for two other fq to check if I can boost the performance. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
bq: can I turn off the three cache and send a lot of queries to Solr I really think you're missing the easiest way to do that. To not put anything in the filter cache, just don't send any fq clauses. As far as the doc cache is concerned, by and large I just wouldn't worry about it. With MMapDirectory, it's less valuable than it was when it was created. It's primary usage is that the components in a single query don't have to re-read the docs from disk. As far as the queryResultCache, by not putting fq clauses on the warmup queries you won't hit this cache next time around. Best, Erick On Tue, Aug 18, 2015 at 1:17 PM, wwang525 wrote: > Hi Erick, > > I just tested 10 different queries with or without the faceting search on > the two properties : departure_date, and hotel_code. Under cold cache > scenario, they have pretty much the same response time, and the faceting > took much less time than the query time. Under cold cache scenario, the > "query" (under timing) is still the "bottleneck". > > I understand that the low level cache needs to be warmed up to do a more > realistic test. However, I do not have a good and consistent way to warm up > the low level cache without caching the filter queries at the same time. If > I load test some random queries before I test these 10 individual queries, I > can see a better response time in some cases, but that could also be due to > filter query cache. > > To load up low level lucene cache without creating filtercache/document > cache etc, can I turn off the three cache and send a lot of queries to Solr > before I start to test the performance of each individual queries? > > Thanks > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223758.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
Hi Erick, I just tested 10 different queries with or without the faceting search on the two properties : departure_date, and hotel_code. Under cold cache scenario, they have pretty much the same response time, and the faceting took much less time than the query time. Under cold cache scenario, the "query" (under timing) is still the "bottleneck". I understand that the low level cache needs to be warmed up to do a more realistic test. However, I do not have a good and consistent way to warm up the low level cache without caching the filter queries at the same time. If I load test some random queries before I test these 10 individual queries, I can see a better response time in some cases, but that could also be due to filter query cache. To load up low level lucene cache without creating filtercache/document cache etc, can I turn off the three cache and send a lot of queries to Solr before I start to test the performance of each individual queries? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223758.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
those are not that high. I was thinking of facets with thousands to tens-of-thousands of unique values. I really wouldn't expect this to be a huge hit unless you're querying all docs. Let us know what you find. Best, Erick On Tue, Aug 18, 2015 at 11:31 AM, wwang525 wrote: > Hi Erick, > > Two facets are probably demanding: > > departure_date have 365 distinct values and hotel_code can have 800 distinct > values. > > The docValues setting definitely helped me a lot even when all the queries > had the above two facets. I will test a list of queries with or without the > two facets after indexing the data (to take advantage of cache warming). > > Thanks > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223744.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
Hi Erick, Two facets are probably demanding: departure_date have 365 distinct values and hotel_code can have 800 distinct values. The docValues setting definitely helped me a lot even when all the queries had the above two facets. I will test a list of queries with or without the two facets after indexing the data (to take advantage of cache warming). Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223744.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it a good query performance with this data size ?
erage response time around 1 > second. > > If I execute a load test again, the average response time will continue > drop. However, it stays at about 500 ms/per request under this load if I try > more tests. > > These are the best results so far. > > I understand that the requests were all different, so it can not be compared > with the case where I execute the same query twice (usually give me a > response time around 150 ms). > > In production environment, many requests may be very similar so that the > filter queries will be executed faster. However, these tests generate all > random requests, and is different than that of production environment. > > In addition, the feature of "warming up cache" may not be applicable to my > test scenarios due to randomly generated requests for all tests. > > I tried to use other search solutions, and the performance was not good. > That was why I tried to use Solr. Now that I am using Solr, I would like to > know In a typical Solr project: > > (1) if it is a good response time for this data size without taking too much > advantage of cache? > (2) if it is possible to improve even further without data sharding? For > example, to get an average of less than 200 ms response time > > Additional information to share: > (1) The tests were done when the Solr instance was not indexing. CPU was > dedicated to the test and RAM was enough. > > (2) most of the setting in solrconfig.xml are default. However, cache > setting were modified. > Note, I think the autowarmCount setting may not be very beneficial to my > tests due to randomly generated requests. However, I still got >50% hit > ratio for filter queries. This is due to the limited values for some filter > queries. > >class="solr.FastLRUCache" > size="4096" > initialSize="1024" > autowarmCount="32"/> > >class="solr.LRUCache" > size="512" > initialSize="512" > autowarmCount="32"/> > > class="solr.LRUCache" > size="1" > initialSize="256" > autowarmCount="0"/> > > > Thanks > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699.html > Sent from the Solr - User mailing list archive at Nabble.com.
Is it a good query performance with this data size ?
Hi All, I am working on a search service based on Solr (v5.1.0). The data size is 15 M records. The size of the index files is 860MB. The test was performed on a local machine that has 8 cores with 32 G memory and CPU is 3.4Ghz (Intel Core i7-3770). I found out that setting docValues=true for faceting and grouping indeed boosted the performance with first-time search under cold cache scenario. For example, with our requests that use all the features like grouping, sorting, faceting, I found the difference of faceting alone can be as much as 300 ms. However, response time for the same request executed the second time seems to be at the same level whether the setting of docValues is true or false. Still, I set up docValues=true for all the faceting properties. The following are what I have observed: (1) Test single request one-by-one (no load) With a cold cache, I execute randomly generated queries one after another. The first query routinely exceed 1 second, but not usually more than 2 seconds. I continue to generate random requests, and execute the queries one-by-one, the response time normally stabilized at the range of 500 ms. It does not seem to improve more as I continue execute randomly generated queries. (2) Load test with randomly generated requests Under load test scenario (each core takes 4 requests per second, and continue for 20 round), I can see the CPU usage jumped, and the earlier requests usually got much longer response time, they may even exceed 5 seconds. However, the CPU usage pattern will then changed to the SAW shape, and the response time will drop, and I can see that the requests got executed faster and faster. I usually gets an average response time around 1 second. If I execute a load test again, the average response time will continue drop. However, it stays at about 500 ms/per request under this load if I try more tests. These are the best results so far. I understand that the requests were all different, so it can not be compared with the case where I execute the same query twice (usually give me a response time around 150 ms). In production environment, many requests may be very similar so that the filter queries will be executed faster. However, these tests generate all random requests, and is different than that of production environment. In addition, the feature of "warming up cache" may not be applicable to my test scenarios due to randomly generated requests for all tests. I tried to use other search solutions, and the performance was not good. That was why I tried to use Solr. Now that I am using Solr, I would like to know In a typical Solr project: (1) if it is a good response time for this data size without taking too much advantage of cache? (2) if it is possible to improve even further without data sharding? For example, to get an average of less than 200 ms response time Additional information to share: (1) The tests were done when the Solr instance was not indexing. CPU was dedicated to the test and RAM was enough. (2) most of the setting in solrconfig.xml are default. However, cache setting were modified. Note, I think the autowarmCount setting may not be very beneficial to my tests due to randomly generated requests. However, I still got >50% hit ratio for filter queries. This is due to the limited values for some filter queries. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Performance
I tried using SolrMeter but for some reason it does not detect my url and throws solr server exception Sent from my iPhone > On 21-Jul-2015, at 10:58 am, Alessandro Benedetti > wrote: > > SolrMeter mate, > > http://code.google.com/p/solrmeter/ > > Take a look, it will help you a lot ! > > Cheers > > 2015-07-21 16:49 GMT+01:00 Nagasharath : > >> Any recommended tool to test the query performance would be of great help. >> >> Thanks > > > > -- > -- > > Benedetti Alessandro > Visiting card - http://about.me/alessandro_benedetti > Blog - http://alexbenedetti.blogspot.co.uk > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England
Re: Query Performance
SolrMeter mate, http://code.google.com/p/solrmeter/ Take a look, it will help you a lot ! Cheers 2015-07-21 16:49 GMT+01:00 Nagasharath : > Any recommended tool to test the query performance would be of great help. > > Thanks > -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Query Performance
Any recommended tool to test the query performance would be of great help. Thanks
Re: SolrCloud delete by query performance
Shawn, thank you very much for that explanation. It helps a lot. Cheers, Ryan On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey wrote: > On 5/20/2015 5:57 PM, Ryan Cutter wrote: > > GC is operating the way I think it should but I am lacking memory. I am > > just surprised because indexing is performing fine (documents going in) > but > > deletions are really bad (documents coming out). > > > > Is it possible these deletes are hitting many segments, each of which I > > assume must be re-built? And if there isn't much slack memory laying > > around to begin with, there's a bunch of contention/swap? > > A deleteByQuery must first query the entire index to determine which IDs > to delete. That's going to hit every segment. In the case of > SolrCloud, it will also hit at least one replica of every single shard > in the collection. > > If the data required to satisfy the query is not already sitting in the > OS disk cache, then the actual disk must be read. When RAM is extremely > tight, any disk operation will erase relevant data out of the OS disk > cache, so the next time it is needed, it must be read off the disk > again. Disks are SLOW. What I am describing is not swap, but the > performance impact is similar to swapping. > > The actual delete operation (once the IDs are known) doesn't touch any > segments ... it writes Lucene document identifiers to a .del file, and > that file is consulted on all queries. Any deleted documents found in > the query results are removed. > > Thanks, > Shawn > >
Re: SolrCloud delete by query performance
On 5/20/2015 5:57 PM, Ryan Cutter wrote: > GC is operating the way I think it should but I am lacking memory. I am > just surprised because indexing is performing fine (documents going in) but > deletions are really bad (documents coming out). > > Is it possible these deletes are hitting many segments, each of which I > assume must be re-built? And if there isn't much slack memory laying > around to begin with, there's a bunch of contention/swap? A deleteByQuery must first query the entire index to determine which IDs to delete. That's going to hit every segment. In the case of SolrCloud, it will also hit at least one replica of every single shard in the collection. If the data required to satisfy the query is not already sitting in the OS disk cache, then the actual disk must be read. When RAM is extremely tight, any disk operation will erase relevant data out of the OS disk cache, so the next time it is needed, it must be read off the disk again. Disks are SLOW. What I am describing is not swap, but the performance impact is similar to swapping. The actual delete operation (once the IDs are known) doesn't touch any segments ... it writes Lucene document identifiers to a .del file, and that file is consulted on all queries. Any deleted documents found in the query results are removed. Thanks, Shawn
Re: SolrCloud delete by query performance
GC is operating the way I think it should but I am lacking memory. I am just surprised because indexing is performing fine (documents going in) but deletions are really bad (documents coming out). Is it possible these deletes are hitting many segments, each of which I assume must be re-built? And if there isn't much slack memory laying around to begin with, there's a bunch of contention/swap? Thanks Shawn! On Wed, May 20, 2015 at 4:50 PM, Shawn Heisey wrote: > On 5/20/2015 5:41 PM, Ryan Cutter wrote: > > I have a collection with 1 billion documents and I want to delete 500 of > > them. The collection has a dozen shards and a couple replicas. Using > Solr > > 4.4. > > > > Sent the delete query via HTTP: > > > > http://hostname:8983/solr/my_collection/update?stream.body= > > source:foo > > > > Took a couple minutes and several replicas got knocked into Recovery > mode. > > They eventually came back and the desired docs were deleted but the > cluster > > wasn't thrilled (high load, etc). > > > > Is this expected behavior? Is there a better way to delete documents > that > > I'm missing? > > That's the correct way to do the delete. Before you'll see the change, > a commit must happen in one way or another. Hopefully you already knew > that. > > I believe that your setup has some performance issues that are making it > very slow and knocking out your Solr nodes temporarily. > > The most common root problems with SolrCloud and indexes going into > recovery are: 1) Your heap is enormous but your garbage collection is > not tuned. 2) You don't have enough RAM, separate from your Java heap, > for adequate index caching. With a billion documents in your > collection, you might even be having problems with both. > > Here's a wiki page that includes some info on both of these problems, > plus a few others: > > http://wiki.apache.org/solr/SolrPerformanceProblems > > Thanks, > Shawn > >
Re: SolrCloud delete by query performance
On 5/20/2015 5:41 PM, Ryan Cutter wrote: > I have a collection with 1 billion documents and I want to delete 500 of > them. The collection has a dozen shards and a couple replicas. Using Solr > 4.4. > > Sent the delete query via HTTP: > > http://hostname:8983/solr/my_collection/update?stream.body= > source:foo > > Took a couple minutes and several replicas got knocked into Recovery mode. > They eventually came back and the desired docs were deleted but the cluster > wasn't thrilled (high load, etc). > > Is this expected behavior? Is there a better way to delete documents that > I'm missing? That's the correct way to do the delete. Before you'll see the change, a commit must happen in one way or another. Hopefully you already knew that. I believe that your setup has some performance issues that are making it very slow and knocking out your Solr nodes temporarily. The most common root problems with SolrCloud and indexes going into recovery are: 1) Your heap is enormous but your garbage collection is not tuned. 2) You don't have enough RAM, separate from your Java heap, for adequate index caching. With a billion documents in your collection, you might even be having problems with both. Here's a wiki page that includes some info on both of these problems, plus a few others: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
SolrCloud delete by query performance
I have a collection with 1 billion documents and I want to delete 500 of them. The collection has a dozen shards and a couple replicas. Using Solr 4.4. Sent the delete query via HTTP: http://hostname:8983/solr/my_collection/update?stream.body= source:foo Took a couple minutes and several replicas got knocked into Recovery mode. They eventually came back and the desired docs were deleted but the cluster wasn't thrilled (high load, etc). Is this expected behavior? Is there a better way to delete documents that I'm missing? Thanks, Ryan
Re: SolrCloud: query performance while indexing
Hi, Will, Have you investigated not using EBS volumes at all? I'm not sure what node size you're using, but for example, you can build a RAID 0 out of the four instance volumes on an m1.xlarge and get lots of disk bandwidth. Also, there's some nice SSD instances available now. http://www.ec2instances.info/ That's assuming disk throughput is your problem. Have you tried using iostat or top to discover what your iowait% is during these merges? Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Thu, Jan 16, 2014 at 3:08 PM, Will Butler wrote: > We currently have a SolrCloud cluster that contains two collections which > we toggle between for querying and indexing. When bulk indexing to our > “offline" collection, our query performance from the “online” collection > suffers somewhat. When segment merges occur, it gets downright abysmal. We > have adjusted several settings that affect flushing and/or merging and have > tried increasing the IOPs capacity of our volumes, without much success. > The best recommendation seems to be to simply have enough ram on each node > for the index to fit into memory (plus additional memory which may be > required for indexing). If this isn’t feasible, it seems that there is no > way around the fact that flushes and merges will potentially take up IO > resources needed for responding to queries. We are currently experimenting > with throttling flushes and merges using maxWriteMBPerSec* settings, which > seems to help if set to fairly low values. Does anyone have any other > recommendations for optimizing SolrCloud to handle both heavy indexing and > querying? > > Thanks, > > Will
SolrCloud: query performance while indexing
We currently have a SolrCloud cluster that contains two collections which we toggle between for querying and indexing. When bulk indexing to our “offline" collection, our query performance from the “online” collection suffers somewhat. When segment merges occur, it gets downright abysmal. We have adjusted several settings that affect flushing and/or merging and have tried increasing the IOPs capacity of our volumes, without much success. The best recommendation seems to be to simply have enough ram on each node for the index to fit into memory (plus additional memory which may be required for indexing). If this isn’t feasible, it seems that there is no way around the fact that flushes and merges will potentially take up IO resources needed for responding to queries. We are currently experimenting with throttling flushes and merges using maxWriteMBPerSec* settings, which seems to help if set to fairly low values. Does anyone have any other recommendations for optimizing SolrCloud to handle both heavy indexing and querying? Thanks, Will
Re: Solrj Query Performance
On 11/28/2013 3:01 AM, Ahmet Arslan wrote: > Are you sure you are using the same exact parameters? I would include > enhoParams=all and compare parameters. Only wt parameter would be different. > wt=javabin for solrJ You can also look at the Solr log, which if you are logging at the normal level of INFO, will contain all parameters used on each query, and compare the two. There is probably some critical difference. Thanks, Shawn
Re: Solrj Query Performance
Hi Parsi, Are you sure you are using the same exact parameters? I would include enhoParams=all and compare parameters. Only wt parameter would be different. wt=javabin for solrJ On Thursday, November 28, 2013 11:42 AM, Prasi S wrote: Hi, We recently saw a behavior which I wanted to confirm, WE are using solrj to query solr. From the code, we use HttpSolrServer to hit the query and return the response 1. When a sample query is hit using Solrj, we get the QTime as 4seconds. The same query when we hit against solr in the browser, we get it in 50milliseconds. Initially we thought it was because of caching. But then, we tried the reverse way. We hit a new query to solr in the browser first. We got in milliseconds. Then we used Solrj, it came to 4.5 seconds. ( We take the QTime from the response object Header. Is this anything to do with Solrj's internal implementation? Thanks, Prasi
Solrj Query Performance
Hi, We recently saw a behavior which I wanted to confirm, WE are using solrj to query solr. From the code, we use HttpSolrServer to hit the query and return the response 1. When a sample query is hit using Solrj, we get the QTime as 4seconds. The same query when we hit against solr in the browser, we get it in 50milliseconds. Initially we thought it was because of caching. But then, we tried the reverse way. We hit a new query to solr in the browser first. We got in milliseconds. Then we used Solrj, it came to 4.5 seconds. ( We take the QTime from the response object Header. Is this anything to do with Solrj's internal implementation? Thanks, Prasi
Re: Cross index join query performance
Ah, got it now - thanks for the explanation. On Sat, Sep 28, 2013 at 3:33 AM, Upayavira wrote: > The thing here is to understand how a join works. > > Effectively, it does the inner query first, which results in a list of > terms. It then effectively does a multi-term query with those values. > > q=size:large {!join fromIndex=other from=someid > to=someotherid}type:shirt > > Imagine the inner join returned values A,B,C. Your inner query is, on > core 'other', q=type:shirt&fl=someid. > > Then your outer query becomes size:large someotherid:(A B C) > > Your inner query returns 25k values. You're having to do a multi-term > query for 25k terms. That is *bound* to be slow. > > The pseudo-joins in Solr 4.x are intended for a small to medium number > of values returned by the inner query, otherwise performance degrades as > you are seeing. > > Is there a way you can reduce the number of values returned by the inner > query? > > As Joel mentions, those other joins are attempts to find other ways to > work with this limitation. > > Upayavira > > On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote: > > Hi Joel, > > > > I tried this patch and it is quite a bit faster. Using the same query on > > a > > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' > > QTime was 100 msec! This was for true for large and small result sets. > > > > A few notes: the patch didn't compile with 4.3 because of the > > SolrCore.getLatestSchema call (which I worked around), and the package > > name > > should be: > > > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/> > > > > Unfortunately, I just learned that our uniqueKey may have to be an > > alphanumeric string instead of an int, so I'm not out of the woods yet. > > > > Good stuff - thanks. > > > > Peter > > > > > > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein > > wrote: > > > > > It looks like you are using int join keys so you may want to check out > > > SOLR-4787, specifically the hjoin and bjoin. > > > > > > These perform well when you have a large number of results from the > > > fromIndex. If you have a small number of results in the fromIndex the > > > standard join will be faster. > > > > > > > > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan > > >wrote: > > > > > > > I forgot to mention - this is Solr 4.3 > > > > > > > > Peter > > > > > > > > > > > > > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan < > peterlkee...@gmail.com > > > > >wrote: > > > > > > > > > I'm doing a cross-core join query and the join query is 30X slower > than > > > > > each of the 2 individual queries. Here are the queries: > > > > > > > > > > Main query: > http://localhost:8983/solr/mainindex/select?q=title:java > > > > > QTime: 5 msec > > > > > hit count: 1000 > > > > > > > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO > > > > 0.3] > > > > > QTime: 4 msec > > > > > hit count: 25K > > > > > > > > > > Join query: > > > > > > > > > > > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1 > TO 0.3] > > > > > QTime: 160 msec > > > > > hit count: 205 > > > > > > > > > > Here are the index spec's: > > > > > > > > > > mainindex size: 117K docs, 1 segment > > > > > mainindex schema: > > > > > > > > > required="true" multiValued="false" /> > > > > > > > > > stored="true" multiValued="false" /> > > > > >docid > > > > > > > > > > subindex size: 117K docs, 1 segment > > > > > subindex schema: > > > > > > > > > required="true" multiValued="false" /> > > > > > > > > > required="false" multiValued="false" /> > > > > >docid > > > > > > > > > > With debugQuery=true I see: > > > > > "debug":{ > > > > > "join":{ > > > > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO > > > 0.3]":{ > > > > > "time":155, > > > > > "fromSetSize":24742, > > > > > "toSetSize":24742, > > > > > "fromTermCount":117810, > > > > > "fromTermTotalDf":117810, > > > > > "fromTermDirectCount":117810, > > > > > "fromTermHits":24742, > > > > > "fromTermHitsTotalDf":24742, > > > > > "toTermHits":24742, > > > > > "toTermHitsTotalDf":24742, > > > > > "toTermDirectCount":24627, > > > > > "smallSetsDeferred":115, > > > > > "toSetDocsAdded":24742}}, > > > > > > > > > > Via profiler and debugger, I see 150 msec spent in the outer > > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This > seems > > > > like a > > > > > lot of time to join the bitsets. Does this seem right? > > > > > > > > > > Peter > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Joel Bernstein > > > Professional Services LucidWorks > > > >
Re: Cross index join query performance
The thing here is to understand how a join works. Effectively, it does the inner query first, which results in a list of terms. It then effectively does a multi-term query with those values. q=size:large {!join fromIndex=other from=someid to=someotherid}type:shirt Imagine the inner join returned values A,B,C. Your inner query is, on core 'other', q=type:shirt&fl=someid. Then your outer query becomes size:large someotherid:(A B C) Your inner query returns 25k values. You're having to do a multi-term query for 25k terms. That is *bound* to be slow. The pseudo-joins in Solr 4.x are intended for a small to medium number of values returned by the inner query, otherwise performance degrades as you are seeing. Is there a way you can reduce the number of values returned by the inner query? As Joel mentions, those other joins are attempts to find other ways to work with this limitation. Upayavira On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote: > Hi Joel, > > I tried this patch and it is quite a bit faster. Using the same query on > a > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' > QTime was 100 msec! This was for true for large and small result sets. > > A few notes: the patch didn't compile with 4.3 because of the > SolrCore.getLatestSchema call (which I worked around), and the package > name > should be: > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/> > > Unfortunately, I just learned that our uniqueKey may have to be an > alphanumeric string instead of an int, so I'm not out of the woods yet. > > Good stuff - thanks. > > Peter > > > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein > wrote: > > > It looks like you are using int join keys so you may want to check out > > SOLR-4787, specifically the hjoin and bjoin. > > > > These perform well when you have a large number of results from the > > fromIndex. If you have a small number of results in the fromIndex the > > standard join will be faster. > > > > > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan > >wrote: > > > > > I forgot to mention - this is Solr 4.3 > > > > > > Peter > > > > > > > > > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan > > >wrote: > > > > > > > I'm doing a cross-core join query and the join query is 30X slower than > > > > each of the 2 individual queries. Here are the queries: > > > > > > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > > > > QTime: 5 msec > > > > hit count: 1000 > > > > > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO > > > 0.3] > > > > QTime: 4 msec > > > > hit count: 25K > > > > > > > > Join query: > > > > > > > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid > > to=docid}fld1:[0.1 TO 0.3] > > > > QTime: 160 msec > > > > hit count: 205 > > > > > > > > Here are the index spec's: > > > > > > > > mainindex size: 117K docs, 1 segment > > > > mainindex schema: > > > > > > > required="true" multiValued="false" /> > > > > > > > stored="true" multiValued="false" /> > > > >docid > > > > > > > > subindex size: 117K docs, 1 segment > > > > subindex schema: > > > > > > > required="true" multiValued="false" /> > > > > > > > required="false" multiValued="false" /> > > > >docid > > > > > > > > With debugQuery=true I see: > > > > "debug":{ > > > > "join":{ > > > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO > > 0.3]":{ > > > > "time":155, > > > > "fromSetSize":24742, > > > > "toSetSize":24742, > > > > "fromTermCount":117810, > > > > "fromTermTotalDf":117810, > > > > "fromTermDirectCount":117810, > > > > "fromTermHits":24742, > > > > "fromTermHitsTotalDf":24742, > > > > "toTermHits":24742, > > > > "toTermHitsTotalDf":24742, > > > > "toTermDirectCount":24627, > > > > "smallSetsDeferred":115, > > > > "toSetDocsAdded":24742}}, > > > > > > > > Via profiler and debugger, I see 150 msec spent in the outer > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems > > > like a > > > > lot of time to join the bitsets. Does this seem right? > > > > > > > > Peter > > > > > > > > > > > > > > > > > > > -- > > Joel Bernstein > > Professional Services LucidWorks > >
Re: Cross index join query performance
Hi Joel, I tried this patch and it is quite a bit faster. Using the same query on a larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin' QTime was 100 msec! This was for true for large and small result sets. A few notes: the patch didn't compile with 4.3 because of the SolrCore.getLatestSchema call (which I worked around), and the package name should be: Unfortunately, I just learned that our uniqueKey may have to be an alphanumeric string instead of an int, so I'm not out of the woods yet. Good stuff - thanks. Peter On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein wrote: > It looks like you are using int join keys so you may want to check out > SOLR-4787, specifically the hjoin and bjoin. > > These perform well when you have a large number of results from the > fromIndex. If you have a small number of results in the fromIndex the > standard join will be faster. > > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan >wrote: > > > I forgot to mention - this is Solr 4.3 > > > > Peter > > > > > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan > >wrote: > > > > > I'm doing a cross-core join query and the join query is 30X slower than > > > each of the 2 individual queries. Here are the queries: > > > > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > > > QTime: 5 msec > > > hit count: 1000 > > > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO > > 0.3] > > > QTime: 4 msec > > > hit count: 25K > > > > > > Join query: > > > > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid > to=docid}fld1:[0.1 TO 0.3] > > > QTime: 160 msec > > > hit count: 205 > > > > > > Here are the index spec's: > > > > > > mainindex size: 117K docs, 1 segment > > > mainindex schema: > > > > > required="true" multiValued="false" /> > > > > > stored="true" multiValued="false" /> > > >docid > > > > > > subindex size: 117K docs, 1 segment > > > subindex schema: > > > > > required="true" multiValued="false" /> > > > > > required="false" multiValued="false" /> > > >docid > > > > > > With debugQuery=true I see: > > > "debug":{ > > > "join":{ > > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO > 0.3]":{ > > > "time":155, > > > "fromSetSize":24742, > > > "toSetSize":24742, > > > "fromTermCount":117810, > > > "fromTermTotalDf":117810, > > > "fromTermDirectCount":117810, > > > "fromTermHits":24742, > > > "fromTermHitsTotalDf":24742, > > > "toTermHits":24742, > > > "toTermHitsTotalDf":24742, > > > "toTermDirectCount":24627, > > > "smallSetsDeferred":115, > > > "toSetDocsAdded":24742}}, > > > > > > Via profiler and debugger, I see 150 msec spent in the outer > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems > > like a > > > lot of time to join the bitsets. Does this seem right? > > > > > > Peter > > > > > > > > > > > > -- > Joel Bernstein > Professional Services LucidWorks >
Re: Cross index join query performance
It looks like you are using int join keys so you may want to check out SOLR-4787, specifically the hjoin and bjoin. These perform well when you have a large number of results from the fromIndex. If you have a small number of results in the fromIndex the standard join will be faster. On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan wrote: > I forgot to mention - this is Solr 4.3 > > Peter > > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan >wrote: > > > I'm doing a cross-core join query and the join query is 30X slower than > > each of the 2 individual queries. Here are the queries: > > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > > QTime: 5 msec > > hit count: 1000 > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO > 0.3] > > QTime: 4 msec > > hit count: 25K > > > > Join query: > > > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindex > from=docid to=docid}fld1:[0.1 TO 0.3] > > QTime: 160 msec > > hit count: 205 > > > > Here are the index spec's: > > > > mainindex size: 117K docs, 1 segment > > mainindex schema: > > > required="true" multiValued="false" /> > > > stored="true" multiValued="false" /> > >docid > > > > subindex size: 117K docs, 1 segment > > subindex schema: > > > required="true" multiValued="false" /> > > > required="false" multiValued="false" /> > >docid > > > > With debugQuery=true I see: > > "debug":{ > > "join":{ > > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{ > > "time":155, > > "fromSetSize":24742, > > "toSetSize":24742, > > "fromTermCount":117810, > > "fromTermTotalDf":117810, > > "fromTermDirectCount":117810, > > "fromTermHits":24742, > > "fromTermHitsTotalDf":24742, > > "toTermHits":24742, > > "toTermHitsTotalDf":24742, > > "toTermDirectCount":24627, > > "smallSetsDeferred":115, > > "toSetDocsAdded":24742}}, > > > > Via profiler and debugger, I see 150 msec spent in the outer > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems > like a > > lot of time to join the bitsets. Does this seem right? > > > > Peter > > > > > -- Joel Bernstein Professional Services LucidWorks
Re: Cross index join query performance
I forgot to mention - this is Solr 4.3 Peter On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan wrote: > I'm doing a cross-core join query and the join query is 30X slower than > each of the 2 individual queries. Here are the queries: > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java > QTime: 5 msec > hit count: 1000 > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3] > QTime: 4 msec > hit count: 25K > > Join query: > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex > toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3] > QTime: 160 msec > hit count: 205 > > Here are the index spec's: > > mainindex size: 117K docs, 1 segment > mainindex schema: > required="true" multiValued="false" /> > stored="true" multiValued="false" /> >docid > > subindex size: 117K docs, 1 segment > subindex schema: > required="true" multiValued="false" /> > required="false" multiValued="false" /> >docid > > With debugQuery=true I see: > "debug":{ > "join":{ > "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{ > "time":155, > "fromSetSize":24742, > "toSetSize":24742, > "fromTermCount":117810, > "fromTermTotalDf":117810, > "fromTermDirectCount":117810, > "fromTermHits":24742, > "fromTermHitsTotalDf":24742, > "toTermHits":24742, > "toTermHitsTotalDf":24742, > "toTermDirectCount":24627, > "smallSetsDeferred":115, > "toSetDocsAdded":24742}}, > > Via profiler and debugger, I see 150 msec spent in the outer > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a > lot of time to join the bitsets. Does this seem right? > > Peter > >
Cross index join query performance
I'm doing a cross-core join query and the join query is 30X slower than each of the 2 individual queries. Here are the queries: Main query: http://localhost:8983/solr/mainindex/select?q=title:java QTime: 5 msec hit count: 1000 Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3] QTime: 4 msec hit count: 25K Join query: http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3] QTime: 160 msec hit count: 205 Here are the index spec's: mainindex size: 117K docs, 1 segment mainindex schema: docid subindex size: 117K docs, 1 segment subindex schema: docid With debugQuery=true I see: "debug":{ "join":{ "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{ "time":155, "fromSetSize":24742, "toSetSize":24742, "fromTermCount":117810, "fromTermTotalDf":117810, "fromTermDirectCount":117810, "fromTermHits":24742, "fromTermHitsTotalDf":24742, "toTermHits":24742, "toTermHitsTotalDf":24742, "toTermDirectCount":24627, "smallSetsDeferred":115, "toSetDocsAdded":24742}}, Via profiler and debugger, I see 150 msec spent in the outer 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a lot of time to join the bitsets. Does this seem right? Peter
Re: Solr4 update and query performance question
bq: There is no batching while updating/inserting documents in Solr3 Correct, but all the updates only went to the server you targeted them for. The batching you're seeing is the auto-distributing the docs to the various shards, a whole different animal. Keep an eye on: https://issues.apache.org/jira/browse/SOLR-4816. You might prompt Joel to see if this is testable. This JIRA routes the docs directly to the leader of the shard they should go to. IOW it does the routing on the client side. There will still be batching from the leader to the replicas, but this should help. It is usually a Bad Thing to commit after every batch either in Solr 3 or Solr 4 from the client. I suspect you're right that the wait for all the searchers on all the shards is one of your problems. Try configuring autocommit (both hard and soft) in solrconfig.xml and forgetting the commit bits from the client. This is the usual pattern in Solr4. Your soft commit (which may be commented out) controls when the documents are searchable. It is less expensive than hard commits with openSearcher=true and makes docs visible. Hard commit closes the current segment and opens a new one. So set up openSearcher=false for your hard commit and a soft commit interval of whatever latency you can stand would by my recommendation. Final note: if you set your hard commit with openSearcher=false, do it fairly often since it truncates the transaction logs and is quite inexpensive. If you let your tlog grow huge, if you kill your server and re-start Solr you get into a situation where solr may replay the tlog. If it has a bazillion docs in it that can take a very long time to start up. Best Erick On Wed, Aug 14, 2013 at 4:39 PM, Joshi, Shital wrote: > We didn't copy/paste Solr3 config to solr4. We started with Solr4 config > and only updated new searcher queries and few other things. > > There is no batching while updating/inserting documents in Solr3, is that > correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4 > it takes about 3-4 minutes. We noticed in Solr4 logs that, commit only > returns after new searcher is created across all nodes. This is possibly > cause waitSearcher=true by default in Solr4. This was not the case with > Solr3, commit would return without waiting for new searcher creation. > > In order to improve performance with Solr4, we first changed from > commit=true to commit=false in update URL and added autoHardCommit setting > in solrconfig.xml. This improved performance from 3-4 minutes to 1-2 > minutes but that is not good enough. > > Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class > from 10 to 1000 and deployed this class in > $JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted > solr4 nodes. But we still see the batch size of 10 being used. Did we > change correct variable/class? > > Next thing We will try using softCommit=true in update url and check if it > gives us desired performance. > > Thanks for looking into this. Appreciate your help. > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, August 13, 2013 8:12 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr4 update and query performance question > > 1> That's hard-coded at present. There's anecdotal evidence that there > are throughput improvements with larger batch sizes, but no action > yet. > 2> Yep, all searchers are also re-opened, caches re-warmed, etc. > 3> Odd. I'm assuming your Solr3 was master/slave setup? Seeing the > queries would help diagnose this. Also, did you try to copy/paste > the configuration from your Solr3 to Solr4? I'd start with the > Solr4 and copy/paste only the parts needed from your SOlr3 setup. > > Best > Erick > > > On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital > wrote: > > > Hi, > > > > We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes > > with about 450 mil documents (~90 mil per shard). We're loading 1000 or > > less documents in CSV format every few minutes. In Solr3, with 300 mil > > documents, it used to take 30 seconds to load 1000 documents while in > > Solr4, its taking up to 3 minutes to load 1000 documents. We're using > > custom sharding, we include _shard_=shardid parameter in update command. > > Upon looking Solr4 log files we found that: > > > > 1. Documents are added in a batch of 10 records. How do we increase > > this batch size from 10 to 1000 documents? > > > > 2. We do hard commit after loading 1000 documents. For every hard > > commit, it refreshes searcher on all nodes. Are all caches also refreshed > > when hard commit happens?
RE: Solr4 update and query performance question
We didn't copy/paste Solr3 config to solr4. We started with Solr4 config and only updated new searcher queries and few other things. There is no batching while updating/inserting documents in Solr3, is that correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4 it takes about 3-4 minutes. We noticed in Solr4 logs that, commit only returns after new searcher is created across all nodes. This is possibly cause waitSearcher=true by default in Solr4. This was not the case with Solr3, commit would return without waiting for new searcher creation. In order to improve performance with Solr4, we first changed from commit=true to commit=false in update URL and added autoHardCommit setting in solrconfig.xml. This improved performance from 3-4 minutes to 1-2 minutes but that is not good enough. Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class from 10 to 1000 and deployed this class in $JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted solr4 nodes. But we still see the batch size of 10 being used. Did we change correct variable/class? Next thing We will try using softCommit=true in update url and check if it gives us desired performance. Thanks for looking into this. Appreciate your help. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, August 13, 2013 8:12 AM To: solr-user@lucene.apache.org Subject: Re: Solr4 update and query performance question 1> That's hard-coded at present. There's anecdotal evidence that there are throughput improvements with larger batch sizes, but no action yet. 2> Yep, all searchers are also re-opened, caches re-warmed, etc. 3> Odd. I'm assuming your Solr3 was master/slave setup? Seeing the queries would help diagnose this. Also, did you try to copy/paste the configuration from your Solr3 to Solr4? I'd start with the Solr4 and copy/paste only the parts needed from your SOlr3 setup. Best Erick On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital wrote: > Hi, > > We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes > with about 450 mil documents (~90 mil per shard). We're loading 1000 or > less documents in CSV format every few minutes. In Solr3, with 300 mil > documents, it used to take 30 seconds to load 1000 documents while in > Solr4, its taking up to 3 minutes to load 1000 documents. We're using > custom sharding, we include _shard_=shardid parameter in update command. > Upon looking Solr4 log files we found that: > > 1. Documents are added in a batch of 10 records. How do we increase > this batch size from 10 to 1000 documents? > > 2. We do hard commit after loading 1000 documents. For every hard > commit, it refreshes searcher on all nodes. Are all caches also refreshed > when hard commit happens? We're planning to change to soft commit and do > auto hard commit every 10-15 minutes. > > 3. We're not seeing improved query performance compared to Solr3. > Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 > seconds with Solr4. We think this could be due to frequent hard commits and > searcher refresh. Do you think when we change to soft commit and increase > the batch size, we will see better query performance. > > Thanks! > > >
Re: Solr4 update and query performance question
1> That's hard-coded at present. There's anecdotal evidence that there are throughput improvements with larger batch sizes, but no action yet. 2> Yep, all searchers are also re-opened, caches re-warmed, etc. 3> Odd. I'm assuming your Solr3 was master/slave setup? Seeing the queries would help diagnose this. Also, did you try to copy/paste the configuration from your Solr3 to Solr4? I'd start with the Solr4 and copy/paste only the parts needed from your SOlr3 setup. Best Erick On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital wrote: > Hi, > > We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes > with about 450 mil documents (~90 mil per shard). We're loading 1000 or > less documents in CSV format every few minutes. In Solr3, with 300 mil > documents, it used to take 30 seconds to load 1000 documents while in > Solr4, its taking up to 3 minutes to load 1000 documents. We're using > custom sharding, we include _shard_=shardid parameter in update command. > Upon looking Solr4 log files we found that: > > 1. Documents are added in a batch of 10 records. How do we increase > this batch size from 10 to 1000 documents? > > 2. We do hard commit after loading 1000 documents. For every hard > commit, it refreshes searcher on all nodes. Are all caches also refreshed > when hard commit happens? We're planning to change to soft commit and do > auto hard commit every 10-15 minutes. > > 3. We're not seeing improved query performance compared to Solr3. > Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 > seconds with Solr4. We think this could be due to frequent hard commits and > searcher refresh. Do you think when we change to soft commit and increase > the batch size, we will see better query performance. > > Thanks! > > >
Solr4 update and query performance question
Hi, We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with about 450 mil documents (~90 mil per shard). We're loading 1000 or less documents in CSV format every few minutes. In Solr3, with 300 mil documents, it used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 3 minutes to load 1000 documents. We're using custom sharding, we include _shard_=shardid parameter in update command. Upon looking Solr4 log files we found that: 1. Documents are added in a batch of 10 records. How do we increase this batch size from 10 to 1000 documents? 2. We do hard commit after loading 1000 documents. For every hard commit, it refreshes searcher on all nodes. Are all caches also refreshed when hard commit happens? We're planning to change to soft commit and do auto hard commit every 10-15 minutes. 3. We're not seeing improved query performance compared to Solr3. Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with Solr4. We think this could be due to frequent hard commits and searcher refresh. Do you think when we change to soft commit and increase the batch size, we will see better query performance. Thanks!
Re: Query Performance
start is a window into the sorted, matched documents. So, whether the second query matches a lot less documents, and hence has less to sort, depends once again on where X lies in the distribution of documents. If X if the first term in the field, the second query would match all documents (except for the first since you used "{" rather than "["). But, the query itself might be slower than a *:* query depending on exactly how Lucene evaluates range queries. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Sunday, July 28, 2013 5:34 PM To: solr-user@lucene.apache.org Subject: Re: Query Performance Actually I have to rewrite my question: Query 1: q=*:*&rows=row_count&sort=id asc&start=X and Query2: q={X TO *}&rows=row_count&sort=id asc&start=0 2013/7/29 Jack Krupansky The second query excludes documents matched by [* TO X], while the first query matches all documents. Relative performance will depend on relative match count and the sort time on the matched documents. Sorting will likely be the dominant factor - for equal number of documents. So, it depends on whether starting with X excludes or includes the majority of documents, relative to whatever row_count might be. Generally, you should only sort a small number of documents/results. Or, consider DocValues since they are designed for sorting. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Sunday, July 28, 2013 5:06 PM To: solr-user@lucene.apache.org Subject: Query Performance What is the difference between: q=*:*&rows=row_count&sort=id asc and q={X TO *}&rows=row_count&sort=id asc Does the first one trys to get all the documents but cut the result or they are same or...? What happens at underlying process of Solr for that two queries?
Re: Query Performance
Actually I have to rewrite my question: Query 1: q=*:*&rows=row_count&sort=id asc&start=X and Query2: q={X TO *}&rows=row_count&sort=id asc&start=0 2013/7/29 Jack Krupansky > The second query excludes documents matched by [* TO X], while the first > query matches all documents. > > Relative performance will depend on relative match count and the sort time > on the matched documents. Sorting will likely be the dominant factor - for > equal number of documents. So, it depends on whether starting with X > excludes or includes the majority of documents, relative to whatever > row_count might be. > > Generally, you should only sort a small number of documents/results. > > Or, consider DocValues since they are designed for sorting. > > -- Jack Krupansky > > -Original Message- From: Furkan KAMACI > Sent: Sunday, July 28, 2013 5:06 PM > To: solr-user@lucene.apache.org > Subject: Query Performance > > > What is the difference between: > > q=*:*&rows=row_count&sort=id asc > > and > > q={X TO *}&rows=row_count&sort=id asc > > Does the first one trys to get all the documents but cut the result or they > are same or...? What happens at underlying process of Solr for that two > queries? >
Re: Query Performance
The second query excludes documents matched by [* TO X], while the first query matches all documents. Relative performance will depend on relative match count and the sort time on the matched documents. Sorting will likely be the dominant factor - for equal number of documents. So, it depends on whether starting with X excludes or includes the majority of documents, relative to whatever row_count might be. Generally, you should only sort a small number of documents/results. Or, consider DocValues since they are designed for sorting. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Sunday, July 28, 2013 5:06 PM To: solr-user@lucene.apache.org Subject: Query Performance What is the difference between: q=*:*&rows=row_count&sort=id asc and q={X TO *}&rows=row_count&sort=id asc Does the first one trys to get all the documents but cut the result or they are same or...? What happens at underlying process of Solr for that two queries?
Query Performance
What is the difference between: q=*:*&rows=row_count&sort=id asc and q={X TO *}&rows=row_count&sort=id asc Does the first one trys to get all the documents but cut the result or they are same or...? What happens at underlying process of Solr for that two queries?
Re: How to improve the Solr "OR" query performance
Hi, Does that OR query need to be scored? Does it repeat? If answers are no and yes, you should use fq, not q. Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Wed, Jul 3, 2013 at 12:07 PM, Kevin Osborn wrote: > Also, what is the total document count for your result set? We have an > application that is also very slow because it does a lot or OR queries. The > problem is that the result set is very large because of the ORs. Profiling > showed that Solr was spending the bulk of its time scoring the documents. > > Also, instead of OR, you may want to look at dismax or edismax. For search > box type applications, OR is not really what you want. It just seems like > what you want. > > -Kevin > > > On Wed, Jul 3, 2013 at 5:10 AM, Toke Eskildsen > wrote: > >> On Wed, 2013-07-03 at 05:48 +0200, huasanyelao wrote: >> > The response time for "OR" query is around 1-2seconds(the "AND" query is >> just about 30ms-40ms ). >> >> The number of hits will also be much lower for the AND-query. To check >> whether it is the OR or the size of the result set that is the problem, >> please try and construct an AND-based query that hits about as many >> documents as your slow OR query. >> >> With an index size of just 9GB, I am surprised that you use sharding. >> Have you tried using just a single instance to avoid the merge-overhead? >> >> - Toke Eskildsen, State and University Library, Denmark >> >> > > > -- > *KEVIN OSBORN* > LEAD SOFTWARE ENGINEER > CNET Content Solutions > OFFICE 949.399.8714 > CELL 949.310.4677 SKYPE osbornk > 5 Park Plaza, Suite 600, Irvine, CA 92614 > [image: CNET Content Solutions]