Query performance degrades when TLOG replica

2020-09-03 Thread Ankit Shah
We have the following setup , solr 7.7.2 with 1 TLOG Leader & 1 TLOG
replica with a single shard. We have about  34.5 million documents with an
approximate index size of 600GB. I have noticed a degraded query
performance whenever the replica is trying to (guessing here) sync or
perform actual replication. To test this, I fire a very basic query using
solrj client & the query comes back right away, but whenever the
replication is trying to see how far behind it is by comparing the
generation ids the same queries take longer. In production we do not make
these simple queries, but rather complex queries with filter queries &
sorting. These queries take too long as compared to our previous
(standalone solr 6.1.0)

Any help here is appreciated

20-09-02 16:35:30 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909
status=0 QTime=0
2020-09-02 16:35:30 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909
status=0 QTime=0
2020-09-02 16:36:00 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909
status=0 QTime=0
2020-09-02 16:36:00 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909
status=0 QTime=0
2020-09-02 16:36:30 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909
status=0 QTime=0
2020-09-02 16:36:30 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458909
status=0 QTime=0
*2020-09-02 16:37:01* INFO  [db_shard1_replica_t3]  webapp=/solr
path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2}
hits=34458909 status=0 QTime=*1011*
*2020-09-02 16:37:01* INFO  [db_shard1_replica_t3]  webapp=/solr
path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2}
hits=34458909 status=0 QTime=*758*
*2020-09-02 16:37:32* INFO  [db_shard1_replica_t3]  webapp=/solr
path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2}
hits=34458957 status=0 QTime=*1077*
*2020-09-02 16:37:32* INFO  [db_shard1_replica_t3]  webapp=/solr
path=/select params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2}
hits=34458957 status=0 QTime=*1081*
2020-09-02 16:38:02 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458957
status=0 QTime=*668*
2020-09-02 16:38:03 INFO  [db_shard1_replica_t3]  webapp=/solr path=/select
params={q=*:*&fl=id&sort=id+desc&rows=1&wt=xml&version=2.2} hits=34458957
status=0 QTime=*1001*


*2020-09-02 16:37:01* INFO  Master's generation: 263116
*2020-09-02 16:37:01* INFO  Master's version: 1599064577322
*2020-09-02 16:37:01* INFO  Slave's generation: 263116
*2020-09-02 16:37:01* INFO  Slave's version: 1599064577322
*2020-09-02 16:37:01* INFO  Slave in sync with master.
2020-09-02 16:37:02 INFO  Master's generation: 104189
2020-09-02 16:37:02 INFO  Master's version: 1599064620532
2020-09-02 16:37:02 INFO  Slave's generation: 104188
2020-09-02 16:37:02 INFO  Slave's version: 1599064560341
2020-09-02 16:37:02 INFO  Starting replication process
2020-09-02 16:37:02 INFO  Number of files in latest index in master: 1010
2020-09-02 16:37:02 INFO  Starting download (fullCopy=false) to
NRTCachingDirectory(MMapDirectory@/opt/solr-7.7.2/server/solr/test_shard1_replica_t3/data/index.20200902163702345
lockFactory=org.apache.lucene.store.NativeFSLockFactory@77247ee;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2020-09-02 16:37:02 INFO  Bytes downloaded: 837587, Bytes skipped
downloading: 0
2020-09-02 16:37:02 INFO  Total time taken for download
(fullCopy=false,bytesDownloaded=837587) : 0 secs (null bytes/sec) to
NRTCachingDirectory(MMapDirectory@/opt/solr-7.7.2/server/solr/test_shard1_replica_t3/data/index.20200902163702345
lockFactory=org.apache.lucene.store.NativeFSLockFactory@77247ee;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
2020-09-02 16:37:03 INFO  New IndexWriter is ready to be used.
2020-09-02 16:37:03 INFO  Master's generation: 124002
2020-09-02 16:37:03 INFO  Master's version: 1599064617242
2020-09-02 16:37:03 INFO  Slave's generation: 124000
2020-09-02 16:37:03 INFO  Slave's version: 1599064492914
2020-09-02 16:37:03 INFO  Starting replication process
2020-09-02 16:37:04 INFO  [db_shard1_replica_t3]  webapp=/solr path=/update
params={update.distrib=FROMLEADER&distrib.from=
http://178.33.234.1:8983/solr/db_shard1_replica_t25/&wt=javabin&version=2}{add=[11

Re: Facet Query performance

2019-07-08 Thread Shawn Heisey

On 7/8/2019 12:00 PM, Midas A wrote:

Number of Docs :50+ docs
Index Size: 300 GB
RAM: 256 GB
JVM: 32 GB


Half a million documents producing an index size of 300GB suggests 
*very* large documents.  That typically produces an index with fields 
that have very high cardinality, due to text tokenization.


Is Solr the only thing running on this machine, or does it have other 
memory-hungry software running on it?


The screenshot described at the following URL may provide more insight. 
It will be important to get the sort correct.  If the columns have been 
customized to show information other than the examples, it may need to 
be adjusted:


https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

Assuming that Solr is the only thing on the machine, then it means you 
have about 224 GB of memory available to cache your index data, which is 
at least 300GB.  Normally I would think being able to cache two thirds 
of the index should be enough for good performance, but it's always 
possible that there is something about your setup that means you don't 
have enough memory.


Are you sure that you need a 32GB heap?  Half a million documents should 
NOT require anywhere near that much heap.



Cardinality:
cat=44
rol=1005
ind=504
cl=2000


These cardinality values are VERY low.  If you are certain about those 
numbers, it is not likely that these fields are significant contributors 
to query time, either with or without docValues.  How did you obtain 
those numbers?


Those are not the only fields referenced in your query.  I also see these:

hemp
cEmp
pEmp
is_udis
id
is_resume
upt_date
country
exp
ctc
contents
currdesig
predesig
lng
ttl
kw_sql
kw_it


QTime:  2988 ms


Three seconds for a query with so many facets is something I would 
probably be pretty happy to get.



Our 35% queries takes more than 10 sec.


I have no idea what this sentence means.

Please suggest the ways to improve response time . Attached queries and 
schema.xml and solrconfig.xml


1. Is there any other ways to rewrite queries that improve our query 
performance .?


With the information available, the only suggestion I have currently is 
to replace "q=*" with "q=*:*" -- assuming that the intent is to match 
all documents with the main query.  According to what you attached 
(which I am very surprised to see -- attachments usually don't make it 
to the list), your df parameter is "ttl" ... a field that is heavily 
tokenized.  That means that the cardinality of the ttl field is probably 
VERY high, which would make the wildcard query VERY slow.


2. can we see the DocValues cache in plugin/ stats->cache-> section on 
solr UI panel ?


The admin UI only shows Solr caches.  If Lucene even has a docValues 
cache (and I do not know whether it does), it will not be available in 
Solr's statistics.  I am unaware of any cache in Solr for docValues. 
The entire point of docValues is to avoid the need to generate and cache 
large amounts of data, so I suspect there is not going to be anything 
available in this regard.


Thanks,
Shawn


Re: Facet Query performance

2019-07-08 Thread Shawn Heisey

On 7/8/2019 3:08 AM, Midas A wrote:

I have enabled docvalues on facet field but query is still taking time.

How i can improve the Query time .
docValues="true" multiValued="true" termVectors="true" /> 


*Query: *




There's very little information here -- only a single field definition 
and the query URL.  No information about how many documents, what sort 
of cardinality there is in the fields being used in the query, no 
information about memory and settings, etc.  You haven't even told us 
how long the query takes.


Your main query is a single * wildcard.  A wildcard query is typically 
quite slow.  If you are aiming for all documents, change that to q=*:* 
instead -- this is special syntax that the query parser understands, and 
is normally executed very quickly.


When a field has DocValues defined, it will automatically be used for 
field-based sorting, field-based facets, and field-based grouping. 
DocValues should not be relied on for queries, because indexed data is 
far faster for that usage.  Queries *can* be done with docValues, but it 
would be VERY slow.  Solr will avoid that usage if it can.


I'm reasonably certain that docValues will NOT be used for facet.query 
as long as the field is indexed.


You do have three-field based facets -- using the facet.field parameter. 
 If docValues was present on cat for ALL of the indexing that has 
happened, then they will work for that field, but you have not told us 
whether rol and pref have them defined.


You have a lot of faceting in this query.  That can cause things to be slow.

Thanks,
Shawn


Re: Facet Query performance

2019-07-08 Thread Midas A
Hi
How i can know whether DocValues are getting used or not ?
Please help me here .

On Mon, Jul 8, 2019 at 2:38 PM Midas A  wrote:

> Hi ,
>
> I have enabled docvalues on facet field but query is still taking time.
>
> How i can improve the Query time .
>  docValues="true" multiValued="true" termVectors="true" /> 
>
> *Query: *
> http://X.X.X.X:
> /solr/search/select?df=ttl&ps=0&hl=true&fl=id,upt&f.ind.mincount=1&hl.usePhraseHighlighter=true&f.pref.mincount=1&q.op=OR&fq=NOT+hemp:(%22xgidx29760%22+%22xmwxmonster%22+%22xmwxmonsterindia%22+%22xmwxcom%22+%22xswxmonster+com%22+%22xswxmonster%22+%22xswxmonsterindia+com%22+%22xswxmonsterindia%22)&fq=NOT+cEmp:(%
> 22nomster.com%22+OR+%22utyu%22)&fq=NOT+pEmp:(%22nomster.com
> %22+OR+%22utyu%22)&fq=ind:(5)&fq=NOT+is_udis:2&fq=NOT+id:(92197+OR+240613+OR+249717+OR+1007148+OR+2500513+OR+2534675+OR+2813498+OR+9401682)&lowercaseOperators=true&ps2=0&bq=is_resume:0^-1000&bq=upt_date:[*+TO+NOW/DAY-36MONTHS]^2&bq=upt_date:[NOW/DAY-36MONTHS+TO+NOW/DAY-24MONTHS]^3&bq=upt_date:[NOW/DAY-24MONTHS+TO+NOW/DAY-12MONTHS]^4&bq=upt_date:[NOW/DAY-12MONTHS+TO+NOW/DAY-9MONTHS]^5&bq=upt_date:[NOW/DAY-9MONTHS+TO+NOW/DAY-6MONTHS]^10&bq=upt_date:[NOW/DAY-6MONTHS+TO+NOW/DAY-3MONTHS]^15&bq=upt_date:[NOW/DAY-3MONTHS+TO+*]^20&bq=NOT+country:isoin^-10&facet.query=exp:[+10+TO+11+]&facet.query=exp:[+11+TO+13+]&facet.query=exp:[+13+TO+15+]&facet.query=exp:[+15+TO+17+]&facet.query=exp:[+17+TO+20+]&facet.query=exp:[+20+TO+25+]&facet.query=exp:[+25+TO+109+]&facet.query=ctc:[+100+TO+101+]&facet.query=ctc:[+101+TO+101.5+]&facet.query=ctc:[+101.5+TO+102+]&facet.query=ctc:[+102+TO+103+]&facet.query=ctc:[+103+TO+104+]&facet.query=ctc:[+104+TO+105+]&facet.query=ctc:[+105+TO+107.5+]&facet.query=ctc:[+107.5+TO+110+]&facet.query=ctc:[+110+TO+115+]&facet.query=ctc:[+115+TO+10100+]&ps3=0&qf=contents^0.05+currdesig^1.5+predesig^1.5+lng^2+ttl+kw_skl+kw_it&f.cl.mincount=1&sow=false&hl.fl=ttl,kw_skl,kw_it,contents&wt=json&f.cat.mincount=1&qs=0&facet.field=ind&facet.field=cat&facet.field=rol&facet.field=cl&facet.field=pref&debug=timing&qt=/resumesearch&f.rol.mincount=1&start=0&rows=40&version=2&q=*&facet.limit=10&pf=id&hl.q=&facet.mincount=1&pf3=id&pf2=id&facet=true&debugQuery=false
>
>


Facet Query performance

2019-07-08 Thread Midas A
Hi ,

I have enabled docvalues on facet field but query is still taking time.

How i can improve the Query time .
 

*Query: *
http://X.X.X.X:
/solr/search/select?df=ttl&ps=0&hl=true&fl=id,upt&f.ind.mincount=1&hl.usePhraseHighlighter=true&f.pref.mincount=1&q.op=OR&fq=NOT+hemp:(%22xgidx29760%22+%22xmwxmonster%22+%22xmwxmonsterindia%22+%22xmwxcom%22+%22xswxmonster+com%22+%22xswxmonster%22+%22xswxmonsterindia+com%22+%22xswxmonsterindia%22)&fq=NOT+cEmp:(%
22nomster.com%22+OR+%22utyu%22)&fq=NOT+pEmp:(%22nomster.com
%22+OR+%22utyu%22)&fq=ind:(5)&fq=NOT+is_udis:2&fq=NOT+id:(92197+OR+240613+OR+249717+OR+1007148+OR+2500513+OR+2534675+OR+2813498+OR+9401682)&lowercaseOperators=true&ps2=0&bq=is_resume:0^-1000&bq=upt_date:[*+TO+NOW/DAY-36MONTHS]^2&bq=upt_date:[NOW/DAY-36MONTHS+TO+NOW/DAY-24MONTHS]^3&bq=upt_date:[NOW/DAY-24MONTHS+TO+NOW/DAY-12MONTHS]^4&bq=upt_date:[NOW/DAY-12MONTHS+TO+NOW/DAY-9MONTHS]^5&bq=upt_date:[NOW/DAY-9MONTHS+TO+NOW/DAY-6MONTHS]^10&bq=upt_date:[NOW/DAY-6MONTHS+TO+NOW/DAY-3MONTHS]^15&bq=upt_date:[NOW/DAY-3MONTHS+TO+*]^20&bq=NOT+country:isoin^-10&facet.query=exp:[+10+TO+11+]&facet.query=exp:[+11+TO+13+]&facet.query=exp:[+13+TO+15+]&facet.query=exp:[+15+TO+17+]&facet.query=exp:[+17+TO+20+]&facet.query=exp:[+20+TO+25+]&facet.query=exp:[+25+TO+109+]&facet.query=ctc:[+100+TO+101+]&facet.query=ctc:[+101+TO+101.5+]&facet.query=ctc:[+101.5+TO+102+]&facet.query=ctc:[+102+TO+103+]&facet.query=ctc:[+103+TO+104+]&facet.query=ctc:[+104+TO+105+]&facet.query=ctc:[+105+TO+107.5+]&facet.query=ctc:[+107.5+TO+110+]&facet.query=ctc:[+110+TO+115+]&facet.query=ctc:[+115+TO+10100+]&ps3=0&qf=contents^0.05+currdesig^1.5+predesig^1.5+lng^2+ttl+kw_skl+kw_it&f.cl.mincount=1&sow=false&hl.fl=ttl,kw_skl,kw_it,contents&wt=json&f.cat.mincount=1&qs=0&facet.field=ind&facet.field=cat&facet.field=rol&facet.field=cl&facet.field=pref&debug=timing&qt=/resumesearch&f.rol.mincount=1&start=0&rows=40&version=2&q=*&facet.limit=10&pf=id&hl.q=&facet.mincount=1&pf3=id&pf2=id&facet=true&debugQuery=false


Re: Optimizing fq query performance

2019-04-18 Thread John Davis
FYI
https://issues.apache.org/jira/browse/SOLR-11437
https://issues.apache.org/jira/browse/SOLR-12488

On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey  wrote:

> On 4/17/2019 11:49 PM, John Davis wrote:
> > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
> > *] doesn't seem materially different compared to has_field:1. If no one
> > knows why Lucene optimizes one but not another, it's not clear whether it
> > even optimizes one to be sure.
>
> Queries using a boolean field will be even faster than the all-inclusive
> range query ... but they require work at index time to function
> properly.  If you can do it this way, that's definitely preferred.  I
> was providing you with something that would work even without the
> separate boolean field.
>
> If the cardinality of the field you're searching is very low (only a few
> possible values for that field across the whole index) then a wildcard
> query can be fast.  It is only when the cardinality is high that the
> wildcard query is slow.  Still, it is better to use the range query for
> determining whether the field exists, unless you have a separate boolean
> field for that purpose, in which case the boolean query will be a little
> bit faster.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-18 Thread Shawn Heisey

On 4/17/2019 11:49 PM, John Davis wrote:

I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.


Queries using a boolean field will be even faster than the all-inclusive 
range query ... but they require work at index time to function 
properly.  If you can do it this way, that's definitely preferred.  I 
was providing you with something that would work even without the 
separate boolean field.


If the cardinality of the field you're searching is very low (only a few 
possible values for that field across the whole index) then a wildcard 
query can be fast.  It is only when the cardinality is high that the 
wildcard query is slow.  Still, it is better to use the range query for 
determining whether the field exists, unless you have a separate boolean 
field for that purpose, in which case the boolean query will be a little 
bit faster.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.

On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey  wrote:

> On 4/17/2019 1:21 PM, John Davis wrote:
> > If what you describe is the case for range query [* TO *], why would
> lucene
> > not optimize field:* similar way?
>
> I don't know.  Low level lucene operation is a mystery to me.
>
> I have seen first-hand that the range query is MUCH faster than the
> wildcard query.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread Shawn Heisey

On 4/17/2019 1:21 PM, John Davis wrote:

If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?


I don't know.  Low level lucene operation is a mystery to me.

I have seen first-hand that the range query is MUCH faster than the 
wildcard query.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?

On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey  wrote:

> On 4/17/2019 10:51 AM, John Davis wrote:
> > Can you clarify why field:[* TO *] is lot more efficient than field:*
>
> It's a range query.  For every document, Lucene just has to answer two
> questions -- is the value more than any possible value and is the value
> less than any possible value.  The answer will be yes if the field
> exists, and no if it doesn't.  With one million documents, there are two
> million questions that Lucene has to answer.  Which probably seems like
> a lot ... but keep reading.  (Side note:  It wouldn't surprise me if
> Lucene has an optimization specifically for the all inclusive range such
> that it actually only asks one question, not two)
>
> With a wildcard query, there are as many questions as there are values
> in the field.  Every question is asked for every single document.  So if
> you have a million documents and there are three hundred thousand
> different values contained in the field across the whole index, that's
> 300 billion questions.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-17 Thread Shawn Heisey

On 4/17/2019 10:51 AM, John Davis wrote:

Can you clarify why field:[* TO *] is lot more efficient than field:*


It's a range query.  For every document, Lucene just has to answer two 
questions -- is the value more than any possible value and is the value 
less than any possible value.  The answer will be yes if the field 
exists, and no if it doesn't.  With one million documents, there are two 
million questions that Lucene has to answer.  Which probably seems like 
a lot ... but keep reading.  (Side note:  It wouldn't surprise me if 
Lucene has an optimization specifically for the all inclusive range such 
that it actually only asks one question, not two)


With a wildcard query, there are as many questions as there are values 
in the field.  Every question is asked for every single document.  So if 
you have a million documents and there are three hundred thousand 
different values contained in the field across the whole index, that's 
300 billion questions.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-17 Thread John Davis
Can you clarify why field:[* TO *] is lot more efficient than field:*

On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey  wrote:

> On 4/13/2019 12:58 PM, John Davis wrote:
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
>
> All filters cover the entire index, unless the query parser that you're
> using implements the PostFilter interface, the filter cost is set high
> enough, and caching is disabled.  All three of those conditions must be
> met in order for a filter to only run on results instead of the entire
> index.
>
> http://yonik.com/advanced-filter-caching-in-solr/
> https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
>
> Most query parsers don't implement the PostFilter interface.  The lucene
> and edismax parsers do not implement PostFilter.  Unless you've
> specified the query parser in the fq parameter, it will use the lucene
> query parser, and it cannot be a PostFilter.
>
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
>
> If the point of the "field1:*" query clause is "make sure field1 exists
> in the document" then you would be a lot better off with this query clause:
>
> field1:[* TO *]
>
> This is an all-inclusive range query.  It works with all field types
> where I have tried it, and that includes TextField types.   It will be a
> lot more efficient than the wildcard query.
>
> Here's what happens with "field1:*".  If the cardinality of field1 is
> ten million different values, then the query that gets constructed for
> Lucene will literally contain ten million values.  And every single one
> of them will need to be compared to every document.  That's a LOT of
> comparisons.  Wildcard queries are normally very slow.
>
> Thanks,
> Shawn
>


Re: Optimizing fq query performance

2019-04-14 Thread Shawn Heisey

On 4/13/2019 12:58 PM, John Davis wrote:

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.


All filters cover the entire index, unless the query parser that you're 
using implements the PostFilter interface, the filter cost is set high 
enough, and caching is disabled.  All three of those conditions must be 
met in order for a filter to only run on results instead of the entire 
index.


http://yonik.com/advanced-filter-caching-in-solr/
https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/

Most query parsers don't implement the PostFilter interface.  The lucene 
and edismax parsers do not implement PostFilter.  Unless you've 
specified the query parser in the fq parameter, it will use the lucene 
query parser, and it cannot be a PostFilter.



Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'


If the point of the "field1:*" query clause is "make sure field1 exists 
in the document" then you would be a lot better off with this query clause:


field1:[* TO *]

This is an all-inclusive range query.  It works with all field types 
where I have tried it, and that includes TextField types.   It will be a 
lot more efficient than the wildcard query.


Here's what happens with "field1:*".  If the cardinality of field1 is 
ten million different values, then the query that gets constructed for 
Lucene will literally contain ten million values.  And every single one 
of them will need to be compared to every document.  That's a LOT of 
comparisons.  Wildcard queries are normally very slow.


Thanks,
Shawn


Re: Optimizing fq query performance

2019-04-14 Thread Erick Erickson
Patches welcome, but how would that be done? There’s no fixed schema at the 
Lucene level. It’s even possible  that no two documents in the index have any 
fields in common. Given the structure of an inverted index, answering the 
question “for document X does it have any value?" is rather “interesting”. You 
might be able to do something with docValues and function queries, but that’s 
overkill.

In some sense, fq=field:* does this dynamically by putting the results in the 
filterCache where it requires no calculations the next time so it seems like 
more effort than it’s worth.

Best,
Erick

> On Apr 13, 2019, at 11:24 PM, John Davis  wrote:
> 
>> field1:* is slow in general for indexed fields because all terms for the
>> field need to be iterated (e.g. does term1 match doc1, does term2 match
>> doc1, etc)
> 
> This feels like something could be optimized internally by tracking
> existence of the field in a doc instead of making users index yet another
> field to track existence?
> 
> BTW does this same behavior apply for tlong fields too where the value
> might be more continuous vs discrete strings?
> 
> On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley  wrote:
> 
>> More constrained but matching the same set of documents just guarantees
>> that there is more information to evaluate per document matched.
>> For your specific case, you can optimize fq = 'field1:* AND field2:value'
>> to &fq=field1:*&fq=field2:value
>> This will at least cause field1:* to be cached and reused if it's a common
>> pattern.
>> field1:* is slow in general for indexed fields because all terms for the
>> field need to be iterated (e.g. does term1 match doc1, does term2 match
>> doc1, etc)
>> One can optimize this by indexing a term in a different field to turn it
>> into a single term query (i.e. exists:field1)
>> 
>> -Yonik
>> 
>> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
>> wrote:
>> 
>>> Hi there,
>>> 
>>> We noticed a sizable performance degradation when we add certain fq
>> filters
>>> to the query even though the result set does not change between the two
>>> queries. I would've expected solr to optimize internally by picking the
>>> most constrained fq filter first, but maybe my understanding is wrong.
>>> Here's an example:
>>> 
>>> query1: fq = 'field1:* AND field2:value'
>>> query2: fq = 'field2:value'
>>> 
>>> If we assume that the result set is identical between the two queries and
>>> field1 is in general more frequent in the index, we noticed query1 takes
>>> 100x longer than query2. In case it matters field1 is of type tlongs
>> while
>>> field2 is a string.
>>> 
>>> Any tips for optimizing this?
>>> 
>>> John
>>> 
>> 



Re: Optimizing fq query performance

2019-04-13 Thread John Davis
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)

This feels like something could be optimized internally by tracking
existence of the field in a doc instead of making users index yet another
field to track existence?

BTW does this same behavior apply for tlong fields too where the value
might be more continuous vs discrete strings?

On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley  wrote:

> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
>
> -Yonik
>
> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
> wrote:
>
> > Hi there,
> >
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
> >
> > If we assume that the result set is identical between the two queries and
> > field1 is in general more frequent in the index, we noticed query1 takes
> > 100x longer than query2. In case it matters field1 is of type tlongs
> while
> > field2 is a string.
> >
> > Any tips for optimizing this?
> >
> > John
> >
>


Re: Optimizing fq query performance

2019-04-13 Thread Erick Erickson
Also note that field1:* does not necessarily match all documents. A document 
without that field will not match. So it really can’t be optimized they way you 
might expect since, as Yonik says, all the terms have to be enumerated….

Best,
Erick

> On Apr 13, 2019, at 12:30 PM, Yonik Seeley  wrote:
> 
> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
> 
> -Yonik
> 
> On Sat, Apr 13, 2019 at 2:58 PM John Davis 
> wrote:
> 
>> Hi there,
>> 
>> We noticed a sizable performance degradation when we add certain fq filters
>> to the query even though the result set does not change between the two
>> queries. I would've expected solr to optimize internally by picking the
>> most constrained fq filter first, but maybe my understanding is wrong.
>> Here's an example:
>> 
>> query1: fq = 'field1:* AND field2:value'
>> query2: fq = 'field2:value'
>> 
>> If we assume that the result set is identical between the two queries and
>> field1 is in general more frequent in the index, we noticed query1 takes
>> 100x longer than query2. In case it matters field1 is of type tlongs while
>> field2 is a string.
>> 
>> Any tips for optimizing this?
>> 
>> John
>> 



Re: Optimizing fq query performance

2019-04-13 Thread Yonik Seeley
More constrained but matching the same set of documents just guarantees
that there is more information to evaluate per document matched.
For your specific case, you can optimize fq = 'field1:* AND field2:value'
to &fq=field1:*&fq=field2:value
This will at least cause field1:* to be cached and reused if it's a common
pattern.
field1:* is slow in general for indexed fields because all terms for the
field need to be iterated (e.g. does term1 match doc1, does term2 match
doc1, etc)
One can optimize this by indexing a term in a different field to turn it
into a single term query (i.e. exists:field1)

-Yonik

On Sat, Apr 13, 2019 at 2:58 PM John Davis 
wrote:

> Hi there,
>
> We noticed a sizable performance degradation when we add certain fq filters
> to the query even though the result set does not change between the two
> queries. I would've expected solr to optimize internally by picking the
> most constrained fq filter first, but maybe my understanding is wrong.
> Here's an example:
>
> query1: fq = 'field1:* AND field2:value'
> query2: fq = 'field2:value'
>
> If we assume that the result set is identical between the two queries and
> field1 is in general more frequent in the index, we noticed query1 takes
> 100x longer than query2. In case it matters field1 is of type tlongs while
> field2 is a string.
>
> Any tips for optimizing this?
>
> John
>


Optimizing fq query performance

2019-04-13 Thread John Davis
Hi there,

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.
Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'

If we assume that the result set is identical between the two queries and
field1 is in general more frequent in the index, we noticed query1 takes
100x longer than query2. In case it matters field1 is of type tlongs while
field2 is a string.

Any tips for optimizing this?

John


Benchmarking Solr Query performance

2018-02-09 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi all, 
We would like to perform a benchmark of 
https://issues.apache.org/jira/browse/SOLR-11831
The patch improves the performance of grouped queries asking only for one 
result per group (aka. group.limit=1).

I remember seeing a page showing a benchmark of the query performance on 
Wikipedia, 
Do you know if there is a way in solr to reproduce the same benchmark? Or some 
independent library to do that? 

thanks,
Diego

Re: EXT: Re: Solr Query Performance benchmarking

2017-05-05 Thread Suresh Pendap
Thanks everyone for taking time to respond to my email. I think you are
correct in that the query results might be coming from main memory as I
only had around 7k queries.
However it is still not clear to me, given that everything was being
served from main memory, why is that I am not able to push the CPU usage
further up by putting more load on the cluster?

Thanks
Suresh

On 4/28/17, 6:44 PM, "Shawn Heisey"  wrote:

>On 4/28/2017 12:43 PM, Toke Eskildsen wrote:
>> Shawn Heisey  wrote:
>>> Adding more shards as Toke suggested *might* help,[...]
>> I seem to have phrased my suggestion poorly. What I meant to suggest
>> was a switch to a single shard (with 4 replicas) setup, instead of the
>> current 2 shards (with 2 replicas).
>
>Reading it a second time, it's me who made the error here.  You did say
>1 shard and 4 replicas, I didn't read it correctly.
>
>Apologies!
>
>Thanks,
>Shawn
>
>



Re: Solr Query Performance benchmarking

2017-04-28 Thread Shawn Heisey
On 4/28/2017 12:43 PM, Toke Eskildsen wrote:
> Shawn Heisey  wrote:
>> Adding more shards as Toke suggested *might* help,[...] 
> I seem to have phrased my suggestion poorly. What I meant to suggest
> was a switch to a single shard (with 4 replicas) setup, instead of the
> current 2 shards (with 2 replicas). 

Reading it a second time, it's me who made the error here.  You did say
1 shard and 4 replicas, I didn't read it correctly.

Apologies!

Thanks,
Shawn



RE: Solr Query Performance benchmarking

2017-04-28 Thread Davis, Daniel (NIH/NLM) [C]
Beautiful, thank you.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Friday, April 28, 2017 3:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Query Performance benchmarking

I use the JMeter plugins. They’ve been reorganized recently, so they aren’t 
where I originally downloaded them.

Try this:

https://jmeter-plugins.org/wiki/RespTimePercentiles/ 
<https://jmeter-plugins.org/wiki/RespTimePercentiles/>
https://jmeter-plugins.org/wiki/JMeterPluginsCMD/ 
<https://jmeter-plugins.org/wiki/JMeterPluginsCMD/>

Here is the command. It processes the previous JTL output file and puts the 
result in test.csv.

java -Xmx2g -jar CMDRunner.jar --tool Reporter --generate-csv 
${prev_dir}/${test} \
--input-jtl ${prev_dir}/${out} --plugin-type ResponseTimesPercentiles \
>> $logfile 2>&1

The script prints a summary of the run. I need to fix that to also print out 
the header for the columns.

pct25=`grep "^25.0," ${test} | cut -d , -f 2-` median=`grep "^50.0," ${test} | 
cut -d , -f 2-` pct75=`grep "^75.0," ${test} | cut -d , -f 2-` pct90=`grep 
"^90.0," ${test} | cut -d , -f 2-` pct95=`grep "^95.0," ${test} | cut -d , -f 
2-`

echo `date` ": 25th percentiles are $pct25"
echo `date` ": medians are $median"
echo `date` ": 75th percentiles are $pct75"
echo `date` ": 90th percentiles are $pct90"
echo `date` ": 95th percentiles are $pct95"
echo `date` ": full results are in ${test}"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 28, 2017, at 12:00 PM, Davis, Daniel (NIH/NLM) [C] 
>  wrote:
> 
> Walter,
> 
> If you can share a pointer to that JMeter add-on, I'd love it.
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org]
> Sent: Friday, April 28, 2017 2:53 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Query Performance benchmarking
> 
> I use production logs to get a mix of common and long-tail queries. It is 
> very hard to get a realistic distribution with synthetic queries.
> 
> A benchmark run goes like this, with a big shell script driving it.
> 
> 1. Reload the collection to clear caches.
> 2. Split the log into a cache warming set (usually the first 2000 queries) 
> and the rest.
> 3. Run the warming set with four threads and no delay. This gets it done but 
> usually does not overload the server.
> 4. Run the test set with hundreds of threads, each set for a particular rate. 
> The overall config is usually between 2000 and 10,000 requests per minute.
> 5. Tests run for 1-2 hours.
> 6. Grep the results for non-200 responses, filter them out, and report.
> 7. Post process the results to make a CSV file of the percentile response 
> times, one column for each request handler.
> 
> The benchmark driver is a headless JMeter, run with two different config 
> files (warming and test). The post processing is a JMeter add-on.
> 
> If the CPU gets over about 60% or the run queue gets to about the number of 
> processors, the hosts are near congestion. The response time will spike if it 
> is pushed harder than that.
> 
> Prod logs are usually from a few hours of peak traffic during the daytime. 
> This reduces the amount of bot traffic in the logs. I filter out load 
> balancer health checks, Zabbix checks, and so on. I like to get a log of a 
> million queries. That might require grabbing pen traffic logs from several 
> days.
> 
> With the master/slave cluster, I use logs from a single slave. Those will 
> have a lower cache hit rate because the requests are randomly spread out. For 
> our Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive!
> 
> There a script in the JMeter config to make /handler and /select?qt=/handler 
> get reported as the same thing. Thank you SolrJ.
> 
> Our SLAs are for 95th percentile.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Apr 28, 2017, at 11:39 AM, Erick Erickson  wrote:
>> 
>> Well, the best way to get no cache hits is to set the cache sizes to 
>> zero ;). That provides worst-case scenarios and tells you exactly how 
>> much you're relying on caches. I'm not talking the lower-level Lucene 
>> caches here.
>> 
>> One thing I've done is use the TermsComponent to generate a list of 
>> terms actually in my corpus, and save them away "somewhere" to 
>> substitute into my queries. The problem with that is when you have 
>> anything except very simple queries involving AND, you generate 
>> unrealistic queries when you substitute in random val

Re: Solr Query Performance benchmarking

2017-04-28 Thread Walter Underwood
I use the JMeter plugins. They’ve been reorganized recently, so they aren’t 
where I originally downloaded them.

Try this:

https://jmeter-plugins.org/wiki/RespTimePercentiles/ 
<https://jmeter-plugins.org/wiki/RespTimePercentiles/>
https://jmeter-plugins.org/wiki/JMeterPluginsCMD/ 
<https://jmeter-plugins.org/wiki/JMeterPluginsCMD/>

Here is the command. It processes the previous JTL output file and puts the 
result in test.csv.

java -Xmx2g -jar CMDRunner.jar --tool Reporter --generate-csv 
${prev_dir}/${test} \
--input-jtl ${prev_dir}/${out} --plugin-type ResponseTimesPercentiles \
>> $logfile 2>&1

The script prints a summary of the run. I need to fix that to also print out 
the header for the columns.

pct25=`grep "^25.0," ${test} | cut -d , -f 2-`
median=`grep "^50.0," ${test} | cut -d , -f 2-`
pct75=`grep "^75.0," ${test} | cut -d , -f 2-`
pct90=`grep "^90.0," ${test} | cut -d , -f 2-`
pct95=`grep "^95.0," ${test} | cut -d , -f 2-`

echo `date` ": 25th percentiles are $pct25"
echo `date` ": medians are $median"
echo `date` ": 75th percentiles are $pct75"
echo `date` ": 90th percentiles are $pct90"
echo `date` ": 95th percentiles are $pct95"
echo `date` ": full results are in ${test}"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 28, 2017, at 12:00 PM, Davis, Daniel (NIH/NLM) [C] 
>  wrote:
> 
> Walter, 
> 
> If you can share a pointer to that JMeter add-on, I'd love it.
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Friday, April 28, 2017 2:53 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Query Performance benchmarking
> 
> I use production logs to get a mix of common and long-tail queries. It is 
> very hard to get a realistic distribution with synthetic queries.
> 
> A benchmark run goes like this, with a big shell script driving it.
> 
> 1. Reload the collection to clear caches.
> 2. Split the log into a cache warming set (usually the first 2000 queries) 
> and the rest.
> 3. Run the warming set with four threads and no delay. This gets it done but 
> usually does not overload the server.
> 4. Run the test set with hundreds of threads, each set for a particular rate. 
> The overall config is usually between 2000 and 10,000 requests per minute.
> 5. Tests run for 1-2 hours.
> 6. Grep the results for non-200 responses, filter them out, and report.
> 7. Post process the results to make a CSV file of the percentile response 
> times, one column for each request handler.
> 
> The benchmark driver is a headless JMeter, run with two different config 
> files (warming and test). The post processing is a JMeter add-on.
> 
> If the CPU gets over about 60% or the run queue gets to about the number of 
> processors, the hosts are near congestion. The response time will spike if it 
> is pushed harder than that.
> 
> Prod logs are usually from a few hours of peak traffic during the daytime. 
> This reduces the amount of bot traffic in the logs. I filter out load 
> balancer health checks, Zabbix checks, and so on. I like to get a log of a 
> million queries. That might require grabbing pen traffic logs from several 
> days.
> 
> With the master/slave cluster, I use logs from a single slave. Those will 
> have a lower cache hit rate because the requests are randomly spread out. For 
> our Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive!
> 
> There a script in the JMeter config to make /handler and /select?qt=/handler 
> get reported as the same thing. Thank you SolrJ.
> 
> Our SLAs are for 95th percentile.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Apr 28, 2017, at 11:39 AM, Erick Erickson  wrote:
>> 
>> Well, the best way to get no cache hits is to set the cache sizes to 
>> zero ;). That provides worst-case scenarios and tells you exactly how 
>> much you're relying on caches. I'm not talking the lower-level Lucene 
>> caches here.
>> 
>> One thing I've done is use the TermsComponent to generate a list of 
>> terms actually in my corpus, and save them away "somewhere" to 
>> substitute into my queries. The problem with that is when you have 
>> anything except very simple queries involving AND, you generate 
>> unrealistic queries when you substitute in random values; you can be 
>> asking for totally unrelated terms and especially on short fields that 
>> leads to lots of 0-hit queries which are also unrealistic.
>> 
>> So you get into a long cycle of generating a b

RE: Solr Query Performance benchmarking

2017-04-28 Thread Davis, Daniel (NIH/NLM) [C]
Walter, 

If you can share a pointer to that JMeter add-on, I'd love it.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Friday, April 28, 2017 2:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Query Performance benchmarking

I use production logs to get a mix of common and long-tail queries. It is very 
hard to get a realistic distribution with synthetic queries.

A benchmark run goes like this, with a big shell script driving it.

1. Reload the collection to clear caches.
2. Split the log into a cache warming set (usually the first 2000 queries) and 
the rest.
3. Run the warming set with four threads and no delay. This gets it done but 
usually does not overload the server.
4. Run the test set with hundreds of threads, each set for a particular rate. 
The overall config is usually between 2000 and 10,000 requests per minute.
5. Tests run for 1-2 hours.
6. Grep the results for non-200 responses, filter them out, and report.
7. Post process the results to make a CSV file of the percentile response 
times, one column for each request handler.

The benchmark driver is a headless JMeter, run with two different config files 
(warming and test). The post processing is a JMeter add-on.

If the CPU gets over about 60% or the run queue gets to about the number of 
processors, the hosts are near congestion. The response time will spike if it 
is pushed harder than that.

Prod logs are usually from a few hours of peak traffic during the daytime. This 
reduces the amount of bot traffic in the logs. I filter out load balancer 
health checks, Zabbix checks, and so on. I like to get a log of a million 
queries. That might require grabbing pen traffic logs from several days.

With the master/slave cluster, I use logs from a single slave. Those will have 
a lower cache hit rate because the requests are randomly spread out. For our 
Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive!

There a script in the JMeter config to make /handler and /select?qt=/handler 
get reported as the same thing. Thank you SolrJ.

Our SLAs are for 95th percentile.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 28, 2017, at 11:39 AM, Erick Erickson  wrote:
> 
> Well, the best way to get no cache hits is to set the cache sizes to 
> zero ;). That provides worst-case scenarios and tells you exactly how 
> much you're relying on caches. I'm not talking the lower-level Lucene 
> caches here.
> 
> One thing I've done is use the TermsComponent to generate a list of 
> terms actually in my corpus, and save them away "somewhere" to 
> substitute into my queries. The problem with that is when you have 
> anything except very simple queries involving AND, you generate 
> unrealistic queries when you substitute in random values; you can be 
> asking for totally unrelated terms and especially on short fields that 
> leads to lots of 0-hit queries which are also unrealistic.
> 
> So you get into a long cycle of generating a bunch of queries and 
> removing all queries with less than N hits when you run them. Then 
> generating more. Then... And each time you pick N, it introduces 
> another layer of not-real-world possibly.
> 
> Sometimes it's the best you can do, but if you can cull real-world 
> applications it's _much_ better. Once you have a bunch (I like 10,000) 
> you can be pretty confident. I not only like to run them randomly, but 
> I also like to sub-divide them into N buckets and then run each bucket 
> in order on the theory that that mimics what users actually did, they 
> don't usually just do stuff at random. Any differences between the 
> random and non-random runs can give interesting information.
> 
> Best,
> Erick
> 
> On Fri, Apr 28, 2017 at 9:38 AM, Rick Leir  wrote:
>> (aside: Using Gatling or Jmeter?)
>> 
>> Question: How can you easily randomize something in the query so you get no 
>> cache hits? I think there are several levels of caching.
>> 
>> --
>> Sorry for being brief. Alternate email is rickleir at yahoo dot com



Re: Solr Query Performance benchmarking

2017-04-28 Thread Walter Underwood
I use production logs to get a mix of common and long-tail queries. It is very 
hard to get a realistic distribution with synthetic queries.

A benchmark run goes like this, with a big shell script driving it.

1. Reload the collection to clear caches.
2. Split the log into a cache warming set (usually the first 2000 queries) and 
the rest.
3. Run the warming set with four threads and no delay. This gets it done but 
usually does not overload the server.
4. Run the test set with hundreds of threads, each set for a particular rate. 
The overall config is usually between 2000 and 10,000 requests per minute.
5. Tests run for 1-2 hours.
6. Grep the results for non-200 responses, filter them out, and report.
7. Post process the results to make a CSV file of the percentile response 
times, one column for each request handler.

The benchmark driver is a headless JMeter, run with two different config files 
(warming and test). The post processing is a JMeter add-on.

If the CPU gets over about 60% or the run queue gets to about the number of 
processors, the hosts are near congestion. The response time will spike if it 
is pushed harder than that.

Prod logs are usually from a few hours of peak traffic during the daytime. This 
reduces the amount of bot traffic in the logs. I filter out load balancer 
health checks, Zabbix checks, and so on. I like to get a log of a million 
queries. That might require grabbing pen traffic logs from several days.

With the master/slave cluster, I use logs from a single slave. Those will have 
a lower cache hit rate because the requests are randomly spread out. For our 
Solr Cloud cluster, I’ve created a prod-size cluster in test. Expensive!

There a script in the JMeter config to make /handler and /select?qt=/handler 
get reported as the same thing. Thank you SolrJ.

Our SLAs are for 95th percentile.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 28, 2017, at 11:39 AM, Erick Erickson  wrote:
> 
> Well, the best way to get no cache hits is to set the cache sizes to
> zero ;). That provides worst-case scenarios and tells you exactly how
> much you're relying on caches. I'm not talking the lower-level Lucene
> caches here.
> 
> One thing I've done is use the TermsComponent to generate a list of
> terms actually in my corpus, and save them away "somewhere" to
> substitute into my queries. The problem with that is when you have
> anything except very simple queries involving AND, you generate
> unrealistic queries when you substitute in random values; you can be
> asking for totally unrelated terms and especially on short fields that
> leads to lots of 0-hit queries which are also unrealistic.
> 
> So you get into a long cycle of generating a bunch of queries and
> removing all queries with less than N hits when you run them. Then
> generating more. Then... And each time you pick N, it introduces
> another layer of not-real-world possibly.
> 
> Sometimes it's the best you can do, but if you can cull real-world
> applications it's _much_ better. Once you have a bunch (I like 10,000)
> you can be pretty confident. I not only like to run them randomly, but
> I also like to sub-divide them into N buckets and then run each bucket
> in order on the theory that that mimics what users actually did, they
> don't usually just do stuff at random. Any differences between the
> random and non-random runs can give interesting information.
> 
> Best,
> Erick
> 
> On Fri, Apr 28, 2017 at 9:38 AM, Rick Leir  wrote:
>> (aside: Using Gatling or Jmeter?)
>> 
>> Question: How can you easily randomize something in the query so you get no 
>> cache hits? I think there are several levels of caching.
>> 
>> --
>> Sorry for being brief. Alternate email is rickleir at yahoo dot com



Re: Solr Query Performance benchmarking

2017-04-28 Thread Toke Eskildsen
Shawn Heisey  wrote:
> Adding more shards as Toke suggested *might* help,[...]

I seem to have phrased my suggestion poorly. What I meant to suggest was a 
switch to a single shard (with 4 replicas) setup, instead of the current 2 
shards (with 2 replicas).

- Toke


Re: Solr Query Performance benchmarking

2017-04-28 Thread Erick Erickson
Well, the best way to get no cache hits is to set the cache sizes to
zero ;). That provides worst-case scenarios and tells you exactly how
much you're relying on caches. I'm not talking the lower-level Lucene
caches here.

One thing I've done is use the TermsComponent to generate a list of
terms actually in my corpus, and save them away "somewhere" to
substitute into my queries. The problem with that is when you have
anything except very simple queries involving AND, you generate
unrealistic queries when you substitute in random values; you can be
asking for totally unrelated terms and especially on short fields that
leads to lots of 0-hit queries which are also unrealistic.

So you get into a long cycle of generating a bunch of queries and
removing all queries with less than N hits when you run them. Then
generating more. Then... And each time you pick N, it introduces
another layer of not-real-world possibly.

Sometimes it's the best you can do, but if you can cull real-world
applications it's _much_ better. Once you have a bunch (I like 10,000)
you can be pretty confident. I not only like to run them randomly, but
I also like to sub-divide them into N buckets and then run each bucket
in order on the theory that that mimics what users actually did, they
don't usually just do stuff at random. Any differences between the
random and non-random runs can give interesting information.

Best,
Erick

On Fri, Apr 28, 2017 at 9:38 AM, Rick Leir  wrote:
> (aside: Using Gatling or Jmeter?)
>
> Question: How can you easily randomize something in the query so you get no 
> cache hits? I think there are several levels of caching.
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Solr Query Performance benchmarking

2017-04-28 Thread Rick Leir
(aside: Using Gatling or Jmeter?)

Question: How can you easily randomize something in the query so you get no 
cache hits? I think there are several levels of caching.

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Solr Query Performance benchmarking

2017-04-28 Thread Erick Erickson
re: the q vs. fq question. My claim (not verified) is that the fastest
of all would be q=*:*&fq={!cache=false}. That would bypass the scoring
that putting it in the "q" clause would entail as well as bypass the
filter cache.

But I have to agree with Walter, this is very suspicious IMO. Here's
what I'd do:

Change my solrconfig to have a cache size so that both
queryResultCache and filterCache that was significantly smaller than
the number of queries I was cycling through for my stress test. If you
really want to have a worst-case scenario, set the sizes to zero. If
that _still_ gives you responses in the 30-40ms range you're in great
shape. I suspect Walter and I would be on the same side of a bet that
this won't be true.

I once worked with a client who was thrilled that their QTimes were
3ms. They were firing the same query over and over Which
reinforces Walter's point.

Best,
Erick

On Fri, Apr 28, 2017 at 7:43 AM, Walter Underwood  wrote:
> More “unrealistic” than “amazing”. I bet the set of test queries is smaller 
> than the query result cache size.
>
> Results from cache are about 2 ms, but network communication to the shards 
> would add enough overhead to reach 40 ms.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Apr 28, 2017, at 5:59 AM, Shawn Heisey  wrote:
>>
>> On 4/27/2017 5:20 PM, Suresh Pendap wrote:
>>> Max throughput that I get: 12000 to 12500 reqs/sec
>>> 95 percentile query latency: 30 to 40 msec
>>
>> These numbers are *amazing* ... far better than I would have expected to
>> see on a 27GB index, even in a situation where it fits entirely into
>> available memory.  I would only expect to see a few hundred requests per
>> second, maybe as much as several hundred.  Congratulationsare definitely
>> deserved.
>>
>> Adding more shards as Toke suggested *might* help, but it might also
>> lower performance.  More shards means that a single query from the
>> user's perspective becomes more queries in the background.  Unless you
>> add servers to the cloud to handle the additional shards, more shards
>> will usually slow things down on an index with a high query rate.  On
>> indexes with a very low query rate, more shards on the same hardware is
>> likely to be faster, because there will be plenty of idle CPU capacity.
>>
>> What Toke said about filter queries is right on the money.  Uncached
>> filter queries are pretty expensive.  Once a filter gets cached, it is
>> SUPER fast ... but if you are constantly changing the filter query, then
>> it is unlikely that new filters will be cached.
>>
>> When a particular query does not appear in either the queryResultCache
>> or the filterCache, running it as a clause on the q parameter will
>> usually be faster than running it as an fq parameter.  If that exact
>> query text will be used a LOT, then it makes sense to put it into a
>> filter, where it will become very fast once it is cached.
>>
>> Thanks,
>> Shawn
>>
>


Re: Solr Query Performance benchmarking

2017-04-28 Thread Walter Underwood
More “unrealistic” than “amazing”. I bet the set of test queries is smaller 
than the query result cache size.

Results from cache are about 2 ms, but network communication to the shards 
would add enough overhead to reach 40 ms.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 28, 2017, at 5:59 AM, Shawn Heisey  wrote:
> 
> On 4/27/2017 5:20 PM, Suresh Pendap wrote:
>> Max throughput that I get: 12000 to 12500 reqs/sec
>> 95 percentile query latency: 30 to 40 msec
> 
> These numbers are *amazing* ... far better than I would have expected to
> see on a 27GB index, even in a situation where it fits entirely into
> available memory.  I would only expect to see a few hundred requests per
> second, maybe as much as several hundred.  Congratulationsare definitely
> deserved.
> 
> Adding more shards as Toke suggested *might* help, but it might also
> lower performance.  More shards means that a single query from the
> user's perspective becomes more queries in the background.  Unless you
> add servers to the cloud to handle the additional shards, more shards
> will usually slow things down on an index with a high query rate.  On
> indexes with a very low query rate, more shards on the same hardware is
> likely to be faster, because there will be plenty of idle CPU capacity.
> 
> What Toke said about filter queries is right on the money.  Uncached
> filter queries are pretty expensive.  Once a filter gets cached, it is
> SUPER fast ... but if you are constantly changing the filter query, then
> it is unlikely that new filters will be cached.
> 
> When a particular query does not appear in either the queryResultCache
> or the filterCache, running it as a clause on the q parameter will
> usually be faster than running it as an fq parameter.  If that exact
> query text will be used a LOT, then it makes sense to put it into a
> filter, where it will become very fast once it is cached.
> 
> Thanks,
> Shawn
> 



Re: Solr Query Performance benchmarking

2017-04-28 Thread Shawn Heisey
On 4/27/2017 5:20 PM, Suresh Pendap wrote:
> Max throughput that I get: 12000 to 12500 reqs/sec
> 95 percentile query latency: 30 to 40 msec

These numbers are *amazing* ... far better than I would have expected to
see on a 27GB index, even in a situation where it fits entirely into
available memory.  I would only expect to see a few hundred requests per
second, maybe as much as several hundred.  Congratulationsare definitely
deserved.

Adding more shards as Toke suggested *might* help, but it might also
lower performance.  More shards means that a single query from the
user's perspective becomes more queries in the background.  Unless you
add servers to the cloud to handle the additional shards, more shards
will usually slow things down on an index with a high query rate.  On
indexes with a very low query rate, more shards on the same hardware is
likely to be faster, because there will be plenty of idle CPU capacity.

What Toke said about filter queries is right on the money.  Uncached
filter queries are pretty expensive.  Once a filter gets cached, it is
SUPER fast ... but if you are constantly changing the filter query, then
it is unlikely that new filters will be cached.

When a particular query does not appear in either the queryResultCache
or the filterCache, running it as a clause on the q parameter will
usually be faster than running it as an fq parameter.  If that exact
query text will be used a LOT, then it makes sense to put it into a
filter, where it will become very fast once it is cached.

Thanks,
Shawn



Re: Solr Query Performance benchmarking

2017-04-28 Thread Toke Eskildsen
On Thu, 2017-04-27 at 23:20 +, Suresh Pendap wrote:
> Number of Solr Nodes: 4
> Number of shards: 2
> replication-factor:  2
> Index size: 55 GB
> Shard/Core size: 27.7 GB
> maxConnsPerHost: 1000

The overhead of sharding is not trivial. Your overall index size is
fairly small, relative to your hardware. As your latency is
(assumedly) fine around 30-40ms and you are chasing query throughput,
you should try switching to 1 shard / 4 replica. It should improve your
throughput and will not hurt latency much (latency might also improve,
but that is more uncertain).

> The type of queries are mostly of the below pattern
> q=*:*&fl=orderNo,purchaseOrderNos,timestamp,eventName,eventID,_src_&f
> q=((orderNo:+AND+purchaseOrderNos: )+OR+(+orderNo: alue>))&sort=eventTimestamp+desc&rows=20&wt=javabin&version=2

That seems a but strange. Why don't you use q instead of fq for the
part of your request that changes?
-- 
Toke Eskildsen, Royal Danish Library


Solr Query Performance benchmarking

2017-04-27 Thread Suresh Pendap
Hi,
I am trying to perform Solr Query performance benchmarking and trying to 
measure the maximum throughput and latency that I can get from.a given Solr 
cluster.

Following are my configurations

Number of Solr Nodes: 4
Number of shards: 2
replication-factor:  2
Index size: 55 GB
Shard/Core size: 27.7 GB
maxConnsPerHost: 1000

The Solr nodes are VM's with 16 core vCpu and 112GB RAM.  The CPU is 1-1 and it 
is not overcommitted.

I am generating query load using a Java client program which fires Solr queries 
read from a static file.  The client java program is using the Apache Http 
Client library to invoke the queries. I have already configured the client to 
create 300 max connections.

The type of queries are mostly of the below pattern
q=*:*&fl=orderNo,purchaseOrderNos,timestamp,eventName,eventID,_src_&fq=((orderNo:+AND+purchaseOrderNos:))&sort=eventTimestamp+desc&rows=20&wt=javabin&version=2

Max throughput that I get: 12000 to 12500 reqs/sec
95 percentile query latency: 30 to 40 msec

I am measuring the latency and throughput on the client side in my program.  
The max throughput that I am able to get (sum of each individual clients 
throughput) is 12000 reqs/sec.  I am running with 4 clients each with 50 
threads.  Even if I increase the number of clients, the throughput still seems 
to be the same. It seems like I am hitting the maximum capacity of the cluster 
or some other limit due to which I am not able to put more stress on the server.

My CPU is hitting 60% to 70%.  I have not been able to increase the CPU usage 
more than this even when increasing client threads or generating load with more 
client nodes.

The memory used is around 16% on all the nodes except on one node I am seeing 
the memory used is 41%.

There is hardly any IO happening as it is a read test.

I am wondering what is limiting my throughput, is there some internal thread 
pool limit that I am hitting due to which I am not able to increase my 
CPU/memory usage?

My JVM settings are provided below. I am using G1GC and


-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=13001
-Dcom.sun.management.jmxremote.rmi.port=13001
-Dcom.sun.management.jmxremote.ssl=false
-Djetty.home=/app/solr6/server
-Djetty.port=8983
-Dlog4j.configuration=file:
-Dsolr.autoSoftCommit.maxTime=5000
-Dsolr.autoSoftCommit.minTime=5000
-Dsolr.install.dir=/app/solr6
-Dsolr.log.dir=/app/solrdata6/logs
-Dsolr.log.muteconsole
-Dsolr.solr.home=
-Duser.timezone=UTC
-DzkClientTimeout=15000
-DzkHost=
-XX:+AlwaysPreTouch
-XX:+ResizeTLAB
-XX:+UseG1GC
-XX:+UseGCLogFileRotation
-XX:+UseLargePages
-XX:+UseTLAB
-XX:-UseBiasedLocking
-XX:GCLogFileSize=20M
-XX:MaxGCPauseMillis=50
-XX:NumberOfGCLogFiles=9
-XX:OnOutOfMemoryError=/app/solr6/bin/oom_solr.sh
-Xloggc:
-Xms11g
-Xmx11g
-Xss256k
-verbose:gc
















I have not customized the Solr Cache values.  The DocumentCache, 
QueryResultCache, FieldValueCache everything is using default values.  I read 
in one of the SolrPerformance documents that it is better to leave more memory 
to the Operating system and utilize the OS buffer cache.

Is it the best query throughput that I can extract from this sized cluster and 
index size combination?

Any ideas is highly appreciated.

Thanks
Suresh


RE: DataImportHandler | Query | performance

2016-12-23 Thread Prateek Jain J

Thanks a lot Shawn.


Regards,
Prateek Jain

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: 23 December 2016 01:36 PM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler | Query | performance

On 12/23/2016 5:15 AM, Prateek Jain J wrote:
> We need some advice/views on the way we push our documents in SOLR (4.8.1). 
> So, here are the requirements:
>
> 1.   Document could be from 5 to 100 KB in size.
>
> 2.   10-50 users actively querying solr with different sort of data.
>
> 3.   Data will be available frequently to be pushed to solr (streaming). 
> It must be available with-in 15 seconds to be queried.
>
> Current scenario:
>   We dump data to a json file and have a cron job (in java, each time a 
> new file is created) which reads this file periodically and sends it to SOLR 
> using solrj (via http). This file is massive and could be of size ~GBs in 
> some cases (soft and hard solr commits are configured appropriately).
>
> Issue:
>
> 1.   Multiple cores exist in this SOLR and they too follow similar 
> pattern.
>
> 2.   This causes SOLR to hang and cause OOM in some cases due to, too 
> many FIleDescriptors opened (sometimes, due to other issues)
>
> We would like to know if using DataImportHandler give us any advantage? I 
> just gave a quick glance on Solr Wiki but not clear if it offers any 
> advantages in terms of performance (in this scenario).

If you do find a way to do this with DIH, it might make your "too many open 
files" problems *worse*, not better.  Currently these files you are talking 
about are being handled by a completely separate process, not Solr.  If you 
move this inside Solr, then Solr will open *more* files.

Your SolrJ program should read the files and construct SolrInputDocument 
objects, then send them in batches to Solr.  It should not send massive files 
directly.  That might fix the OOM issues, or it might not -- if not, then your 
Solr machine needs a larger heap.  To deal with the open files problem, you're 
going to have to fiddle with the operating system to allow it to open more 
files.

DIH has limitations that frequently make it necessary for users to write their 
own programs to do indexing.  Since you already have an external process, you 
should improve that, rather than trying to use DIH.

Thanks,
Shawn



Re: DataImportHandler | Query | performance

2016-12-23 Thread Shawn Heisey
On 12/23/2016 5:15 AM, Prateek Jain J wrote:
> We need some advice/views on the way we push our documents in SOLR (4.8.1). 
> So, here are the requirements:
>
> 1.   Document could be from 5 to 100 KB in size.
>
> 2.   10-50 users actively querying solr with different sort of data.
>
> 3.   Data will be available frequently to be pushed to solr (streaming). 
> It must be available with-in 15 seconds to be queried.
>
> Current scenario:
>   We dump data to a json file and have a cron job (in java, each time a 
> new file is created) which reads this file periodically and sends it to SOLR 
> using solrj (via http). This file is massive and could be of size ~GBs in 
> some cases (soft and hard solr commits are configured appropriately).
>
> Issue:
>
> 1.   Multiple cores exist in this SOLR and they too follow similar 
> pattern.
>
> 2.   This causes SOLR to hang and cause OOM in some cases due to, too 
> many FIleDescriptors opened (sometimes, due to other issues)
>
> We would like to know if using DataImportHandler give us any advantage? I 
> just gave a quick glance on Solr Wiki but not clear if it offers any 
> advantages in terms of performance (in this scenario).

If you do find a way to do this with DIH, it might make your "too many
open files" problems *worse*, not better.  Currently these files you are
talking about are being handled by a completely separate process, not
Solr.  If you move this inside Solr, then Solr will open *more* files.

Your SolrJ program should read the files and construct SolrInputDocument
objects, then send them in batches to Solr.  It should not send massive
files directly.  That might fix the OOM issues, or it might not -- if
not, then your Solr machine needs a larger heap.  To deal with the open
files problem, you're going to have to fiddle with the operating system
to allow it to open more files.

DIH has limitations that frequently make it necessary for users to write
their own programs to do indexing.  Since you already have an external
process, you should improve that, rather than trying to use DIH.

Thanks,
Shawn



DataImportHandler | Query | performance

2016-12-23 Thread Prateek Jain J

Hi All,

We need some advice/views on the way we push our documents in SOLR (4.8.1). So, 
here are the requirements:


1.   Document could be from 5 to 100 KB in size.

2.   10-50 users actively querying solr with different sort of data.

3.   Data will be available frequently to be pushed to solr (streaming). It 
must be available with-in 15 seconds to be queried.

Current scenario:
  We dump data to a json file and have a cron job (in java, each time a new 
file is created) which reads this file periodically and sends it to SOLR using 
solrj (via http). This file is massive and could be of size ~GBs in some cases 
(soft and hard solr commits are configured appropriately).

Issue:

1.   Multiple cores exist in this SOLR and they too follow similar pattern.

2.   This causes SOLR to hang and cause OOM in some cases due to, too many 
FIleDescriptors opened (sometimes, due to other issues)

We would like to know if using DataImportHandler give us any advantage? I just 
gave a quick glance on Solr Wiki but not clear if it offers any advantages in 
terms of performance (in this scenario).


Regards,
Prateek Jain



Re: facet query performance

2016-11-14 Thread Toke Eskildsen
On Mon, 2016-11-14 at 11:36 +0530, Midas A wrote:
> How to improve facet query performance

1) Don't shard unless you really need to. Replicas are fine.

2) If the problem is the first facet call, then enable DocValues and
re-index.

3) Keep facet.limit <= 100, especially if you shard.

and most important

4) Describe in detail what you have, how you facet and what you expect.
Give us something to work.


- Toke Eskildsen, State and University Library, Denmark


facet query performance

2016-11-13 Thread Midas A
How to improve facet query performance


Re: Poor Solr Cloud Query Performance against a Small Dataset

2016-11-03 Thread Dave Seltzer
Good tip Rick,

I'll dig in and make sure everything is set up correctly.

Thanks!

-D

Dave Seltzer 
Chief Systems Architect
TVEyes
(203) 254-3600 x222

On Wed, Nov 2, 2016 at 9:05 PM, Rick Leir  wrote:

> Here is a wild guess. Whenever I see a 5 second delay in networking, I
> think DNS timeouts. YMMV, good luck.
>
> cheers -- Rick
>
> On 2016-11-01 04:18 PM, Dave Seltzer wrote:
>
>> Hello!
>>
>> I'm trying to utilize Solr Cloud to help with a hash search problem. The
>> record set has only 4,300 documents.
>>
>> When I run my search against a single core I get results on the order of
>> 10ms. When I run the same search against Solr Cloud results take about
>> 5,000 ms.
>>
>> Is there something about this particular query which makes it perform
>> poorly in a Cloud environment? The query looks like this (linebreaks added
>> for readability):
>>
>> {!frange+l%3D5+u%3D25}sum(
>>  termfreq(hashTable_0,'225706351'),
>>  termfreq(hashTable_1,'17664000'),
>>  termfreq(hashTable_2,'86447642'),
>>  termfreq(hashTable_3,'134816033'),
>>
>
>


Re: Poor Solr Cloud Query Performance against a Small Dataset

2016-11-02 Thread Rick Leir
Here is a wild guess. Whenever I see a 5 second delay in networking, I 
think DNS timeouts. YMMV, good luck.


cheers -- Rick

On 2016-11-01 04:18 PM, Dave Seltzer wrote:

Hello!

I'm trying to utilize Solr Cloud to help with a hash search problem. The
record set has only 4,300 documents.

When I run my search against a single core I get results on the order of
10ms. When I run the same search against Solr Cloud results take about
5,000 ms.

Is there something about this particular query which makes it perform
poorly in a Cloud environment? The query looks like this (linebreaks added
for readability):

{!frange+l%3D5+u%3D25}sum(
 termfreq(hashTable_0,'225706351'),
 termfreq(hashTable_1,'17664000'),
 termfreq(hashTable_2,'86447642'),
 termfreq(hashTable_3,'134816033'),




Poor Solr Cloud Query Performance against a Small Dataset

2016-11-01 Thread Dave Seltzer
Hello!

I'm trying to utilize Solr Cloud to help with a hash search problem. The
record set has only 4,300 documents.

When I run my search against a single core I get results on the order of
10ms. When I run the same search against Solr Cloud results take about
5,000 ms.

Is there something about this particular query which makes it perform
poorly in a Cloud environment? The query looks like this (linebreaks added
for readability):

{!frange+l%3D5+u%3D25}sum(
termfreq(hashTable_0,'225706351'),
termfreq(hashTable_1,'17664000'),
termfreq(hashTable_2,'86447642'),
termfreq(hashTable_3,'134816033'),
termfreq(hashTable_4,'1061820218'),
termfreq(hashTable_5,'543627850'),
termfreq(hashTable_6,'-1828379348'),
termfreq(hashTable_7,'423236759'),
termfreq(hashTable_8,'522192943'),
termfreq(hashTable_9,'572537937'),
termfreq(hashTable_10,'286991887'),
termfreq(hashTable_11,'789711386'),
termfreq(hashTable_12,'235801909'),
termfreq(hashTable_13,'67109911'),
termfreq(hashTable_14,'609628285'),
termfreq(hashTable_15,'1796472850'),
termfreq(hashTable_16,'202312085'),
termfreq(hashTable_17,'306200840'),
termfreq(hashTable_18,'85657669'),
termfreq(hashTable_19,'671548727'),
termfreq(hashTable_20,'71309060'),
termfreq(hashTable_21,'1125848323'),
termfreq(hashTable_22,'1077548043'),
termfreq(hashTable_23,'117638159'),
termfreq(hashTable_24,'-1408039642'))

The schema looks like this:

   
   
   
   
   
   subFingerprintId

I've included some sample output below. I wasn't sure if this was a matter
of changing the routing key in the collections system, or if this is a more
fundamental problem with the way Term Frequencies are counted in a Solr
Cloud environment.

Many thanks!

-Dave

-- Single Core Example Query:
{
  "responseHeader":{
"status":0,
"QTime":13,
"params":{
  "q":"{!frange l=5
u=25}sum(termfreq(hashTable_0,'354749018'),termfreq(hashTable_1,'286534657'),termfreq(hashTable_2,'1798007322'),termfreq(hashTable_3,'151854851'),termfreq(hashTable_4,'142869766'),termfreq(hashTable_5,'240584768'),termfreq(hashTable_6,'68120837'),termfreq(hashTable_7,'134945863'),termfreq(hashTable_8,'688067644'),termfreq(hashTable_9,'621220625'),termfreq(hashTable_10,'1732446991'),termfreq(hashTable_11,'505547282'),termfreq(hashTable_12,'135990559'),termfreq(hashTable_13,'123097623'),termfreq(hashTable_14,'454174225'),termfreq(hashTable_15,'788988675'),termfreq(hashTable_16,'53480196'),termfreq(hashTable_17,'487550779'),termfreq(hashTable_18,'455477045'),termfreq(hashTable_19,'1141310997'),termfreq(hashTable_20,'71322652'),termfreq(hashTable_21,'805503533'),termfreq(hashTable_22,'656158000'),termfreq(hashTable_23,'302410303'),termfreq(hashTable_24,'194970957'))",
  "indent":"on",
  "wt":"json",
  "debugQuery":"on",
  "_":"1478024378680"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"subFingerprintId":"f6c9093e-e8e9-4c0f-aa2a-387b46e7ef2a",
"trackId":"5207095a-0126-4c41-8787-16d41165158a",
"sequenceNumber":136,
"sequenceAt":12.5399129172714,
"hashTable_0":354749018,
"hashTable_1":287779841,
"hashTable_2":1797994010,
"hashTable_3":151854851,
"hashTable_4":375260422,
"hashTable_5":441911360,
"hashTable_6":68120837,
"hashTable_7":420158535,
"hashTable_8":16979004,
"hashTable_9":1443304209,
"hashTable_10":1732468239,
"hashTable_11":455215642,
"hashTable_12":135990559,
"hashTable_13":123093271,
"hashTable_14":1444029969,
"hashTable_15":788988675,
"hashTable_16":53480196,
"hashTable_17":488255035,
"hashTable_18":505809973,
"hashTable_19":201814293,
"hashTable_20":70208520,
"hashTable_21":805503541,
"hashTable_22":658713904,
"hashTable_23":302387775,
"hashTable_24":194970957,
"_version_":1549818240561053696}]
  },
  "debug":{
"rawquerystring":"{!frange l=5
u=25}sum(termfreq(hashTable_0,'354749018'),termfreq(hashTable_1,'286534657'),termfreq(hashTable_2,'1798007322'),termfreq(hashTable_3,'151854851'),termfreq(hashTable_4,'142869766'),termfreq(hashTable_5,'240584768'),termfreq(hashTable_6,'68120837'),termfreq(hashTable_7,'134945863'),termfreq(hashTable_8,'688067644'),termfreq(hashTable_9,'621220625'),termfreq(hashTable_10,'1732446991'),termfreq(hashTable_11,'505547282'),termfreq(hashTable_12,'135990559'),termfreq(hashTable_13,'123097623'),termfreq(hashTable_14,'454174225'),termfreq(hashTable_15,'788988675'),termfreq(hashTable_16,'53480196'),termfreq(hashTable_17,'487550779'),termfreq(hashTable_18,'455477045'),termfreq(hashTable_19,'1141310997'),termfreq(hashTable_20,'71322652'),termfreq(hashTable_21,'805503533'),termfreq(hashTable_22,'656158000'),termfreq(hashTable_23,'302410303'),termfreq(hashTable_24,'194970957'))",
"querystring":"{!frange l=5

Multi-core query performance tuning/monitoring

2016-10-13 Thread Oleg Ievtushok
Hi

I have a few filter queries that use multiple cores join to filter
documents. After I inverted those joins they became slower. So, it looks
something like that:

I used to query "product" core with query that contains fq={!join to=tags
from=preferred_tags fromIndex=user}(country:US AND
...)&fq=product_category:0&...
Now I query "user" core with query that contains fq={!join
to=preferred_tags from=tags fromIndex=product}(product_category:0 AND
...)&fq=country:US&...

Both tags and preferred_tags might contain multiple values and "product"
core is more oftenly used(so could be that the cache is warmer for that
core). "user" index is smaller then "product". After a few queries Solr
seems to warm up and serves the query ~50x faster, but the initial queries
are extremely slow. I tried turning off caching for the filter and making
it's cost higher then 150, but it did not help much. I was thinking about
adding autowarmup queries, but first I want to check what makes the join so
slow, so what would be a right way to debug it to see which part of it is
the slowest one...

Also, if I will go with autowarmup since there are 2 cores involved I
wonder which warmup query should be used... "fq={!join to=preferred_tags
from=tags fromIndex=product}(product_category:0 AND ...)" on "user" core or
"fq=(product_category:0 AND ...)" on "product"...

Solr version is 4.3.0


Regards, Oleg


Re: Effects of insert order on query performance

2016-08-12 Thread Jeff Wartes
Thanks Emir. I’m unfortunately already using a routing key that needs to be at 
the top level, since I’m collapsing on that field. 

Adding a sub-key won’t help much if my theory is correct, as even a single 
shard (distrib=false) showed serious performance degradation, and query latency 
is the max(shard latency). I’d need a routing scheme that assured that a given 
shard has *only* A’s, or *only* B’s.

Even if I could use “permissions” as the top-level routing key though, this is 
a very low cardinality field, so I’d expect to end up with very large 
differences between the sizes of the shards in that case. That’s fine from a 
SolrCloud query perspective of course, but it makes for more difficult resource 
provisioning.


On 8/12/16, 1:39 AM, "Emir Arnautovic"  wrote:

Hi Jeff,

I will not comment on your theory (will let that to guys more familiar 
with Lucene code) but will point to one alternative solution: routing. 
You can use routing to split documents with different permission to 
different shards and use composite hash routing to split "A" (and maybe 
"B" as well) documents to multiple shards. That will make sure all doc 
with the same permission are on the same shard and on query time only 
those will be queried (less shards to query) and there is no need to 
include term query or filter query at all.

Here is blog explaining benefits of composite hash routing: 
https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/

Regards,
Emir

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

On 11.08.2016 19:39, Jeff Wartes wrote:
> This isn’t really a question, although some validation would be nice. 
It’s more of a warning.
>
> Tldr is that the insert order of documents in my collection appears to 
have had a huge effect on my query speed.
>
>
> I have a very large (sharded) SolrCloud 5.4 index. One aspect of this 
index is a multi-valued field (“permissions”) that for 90% of docs contains one 
particular value, (“A”) and for 10% of docs contains another distinct value. 
(“B”) It’s intended to represent something like permissions, so more values are 
possible in the future, but not present currently. In fact, the addition of 
docs with value B to this index was very recent, previously all docs had value 
“A”. All queries, in addition to various other Boolean-query type restrictions, 
have a terms query on this field, like {!terms f=permissions v=A} or {!terms 
f=permissions v=A,B}
>
> Last week, I tried to re-index the whole collection from scratch, using 
source data. Query performance on the resulting re-index proved to be abysmal, 
I could get barely 10% of my previous query throughput, and even that was at 
latencies that were orders of magnitude higher than what I had in production.
>
> I hooked up some CPU profiling to a server that had shards from both the 
old and new version of the collection, and eventually it looked like the 
significant difference in processing the two collections was coming from 
ConstantWeight.scorer()
> Specifically, this line
> 
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
> was far more expensive in my re-indexed collection. From there, the call 
chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the 
extra work happening here:
> 
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169
>
> I don’t pretend to understand all that code, but the difference in my 
re-index appears to have something to do either with that cache, or the 
aggregate docIdSets that need weights generated is simply much bigger in my 
re-index.
>
>
> But the queries didn’t change, and the data is basically the same, what 
else could have changed?
>
> The documents with the “B” distinct value were added recently to the 
high-performance collection, but the A’s and the B’s were all mixed up in the 
source data dump I used to re-index. On a hunch, I manually ordered the docs 
such that the A’s were all first and re-indexed again, and performance is great!
>
> Here’s my theory: Using TieredMergePolicy, the vast quantity of the 
documents in an index are contained in the largest segments. I’m guessing 
there’s an optimization somewhere that says something like “This segment only 
has A’s”. By indexing all the A’s first, those biggest segments only contain 
A’s, and only the smallest, newest segments are unable to make use of that 
optimization.
>
> Here’s the scary part: Although my re-

Re: Effects of insert order on query performance

2016-08-12 Thread Emir Arnautovic

Hi Jeff,

I will not comment on your theory (will let that to guys more familiar 
with Lucene code) but will point to one alternative solution: routing. 
You can use routing to split documents with different permission to 
different shards and use composite hash routing to split "A" (and maybe 
"B" as well) documents to multiple shards. That will make sure all doc 
with the same permission are on the same shard and on query time only 
those will be queried (less shards to query) and there is no need to 
include term query or filter query at all.


Here is blog explaining benefits of composite hash routing: 
https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

On 11.08.2016 19:39, Jeff Wartes wrote:

This isn’t really a question, although some validation would be nice. It’s more 
of a warning.

Tldr is that the insert order of documents in my collection appears to have had 
a huge effect on my query speed.


I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is 
a multi-valued field (“permissions”) that for 90% of docs contains one 
particular value, (“A”) and for 10% of docs contains another distinct value. 
(“B”) It’s intended to represent something like permissions, so more values are 
possible in the future, but not present currently. In fact, the addition of 
docs with value B to this index was very recent, previously all docs had value 
“A”. All queries, in addition to various other Boolean-query type restrictions, 
have a terms query on this field, like {!terms f=permissions v=A} or {!terms 
f=permissions v=A,B}

Last week, I tried to re-index the whole collection from scratch, using source 
data. Query performance on the resulting re-index proved to be abysmal, I could 
get barely 10% of my previous query throughput, and even that was at latencies 
that were orders of magnitude higher than what I had in production.

I hooked up some CPU profiling to a server that had shards from both the old 
and new version of the collection, and eventually it looked like the 
significant difference in processing the two collections was coming from 
ConstantWeight.scorer()
Specifically, this line
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
was far more expensive in my re-indexed collection. From there, the call chain 
goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra 
work happening here:
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169

I don’t pretend to understand all that code, but the difference in my re-index 
appears to have something to do either with that cache, or the aggregate 
docIdSets that need weights generated is simply much bigger in my re-index.


But the queries didn’t change, and the data is basically the same, what else 
could have changed?

The documents with the “B” distinct value were added recently to the 
high-performance collection, but the A’s and the B’s were all mixed up in the 
source data dump I used to re-index. On a hunch, I manually ordered the docs 
such that the A’s were all first and re-indexed again, and performance is great!

Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents 
in an index are contained in the largest segments. I’m guessing there’s an 
optimization somewhere that says something like “This segment only has A’s”. By 
indexing all the A’s first, those biggest segments only contain A’s, and only 
the smallest, newest segments are unable to make use of that optimization.

Here’s the scary part: Although my re-index is now performing well, if this 
theory is right, some random insert (or a deliberate optimize) at some random 
point in the future could cascade a segment merge such that the largest 
segment(s) now contain both A’s and B’s, and performance suddenly goes over a 
cliff. I have no way to prevent this possibility except to stop doing inserts.

My current thinking is that I need to pull the terms-query part out of the 
query and do a filter query for it instead. Probably as a post-filter, since 
I’ve had bad luck with very large filter queries and the filter cache. I’d 
tested this originally (when I only had A’s), but found the performance was a 
bit worse than just leaving it in the query. I’ll take a bit worse and 
predictability over a bit better and a time bomb though, if those are my 
choices.


If anyone has any comments refuting or supporting this theory, I’d certainly 
like to hear it. This is the first time I’ve encountered anything about insert 
order mattering from a performance perspective, and it becomes a general-form 
question around how to handle low-cardinality fields.



Effects of insert order on query performance

2016-08-11 Thread Jeff Wartes

This isn’t really a question, although some validation would be nice. It’s more 
of a warning.

Tldr is that the insert order of documents in my collection appears to have had 
a huge effect on my query speed.


I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is 
a multi-valued field (“permissions”) that for 90% of docs contains one 
particular value, (“A”) and for 10% of docs contains another distinct value. 
(“B”) It’s intended to represent something like permissions, so more values are 
possible in the future, but not present currently. In fact, the addition of 
docs with value B to this index was very recent, previously all docs had value 
“A”. All queries, in addition to various other Boolean-query type restrictions, 
have a terms query on this field, like {!terms f=permissions v=A} or {!terms 
f=permissions v=A,B}

Last week, I tried to re-index the whole collection from scratch, using source 
data. Query performance on the resulting re-index proved to be abysmal, I could 
get barely 10% of my previous query throughput, and even that was at latencies 
that were orders of magnitude higher than what I had in production.

I hooked up some CPU profiling to a server that had shards from both the old 
and new version of the collection, and eventually it looked like the 
significant difference in processing the two collections was coming from 
ConstantWeight.scorer()
Specifically, this line
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
was far more expensive in my re-indexed collection. From there, the call chain 
goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra 
work happening here:
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169

I don’t pretend to understand all that code, but the difference in my re-index 
appears to have something to do either with that cache, or the aggregate 
docIdSets that need weights generated is simply much bigger in my re-index.


But the queries didn’t change, and the data is basically the same, what else 
could have changed?

The documents with the “B” distinct value were added recently to the 
high-performance collection, but the A’s and the B’s were all mixed up in the 
source data dump I used to re-index. On a hunch, I manually ordered the docs 
such that the A’s were all first and re-indexed again, and performance is great!

Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents 
in an index are contained in the largest segments. I’m guessing there’s an 
optimization somewhere that says something like “This segment only has A’s”. By 
indexing all the A’s first, those biggest segments only contain A’s, and only 
the smallest, newest segments are unable to make use of that optimization.

Here’s the scary part: Although my re-index is now performing well, if this 
theory is right, some random insert (or a deliberate optimize) at some random 
point in the future could cascade a segment merge such that the largest 
segment(s) now contain both A’s and B’s, and performance suddenly goes over a 
cliff. I have no way to prevent this possibility except to stop doing inserts.

My current thinking is that I need to pull the terms-query part out of the 
query and do a filter query for it instead. Probably as a post-filter, since 
I’ve had bad luck with very large filter queries and the filter cache. I’d 
tested this originally (when I only had A’s), but found the performance was a 
bit worse than just leaving it in the query. I’ll take a bit worse and 
predictability over a bit better and a time bomb though, if those are my 
choices.


If anyone has any comments refuting or supporting this theory, I’d certainly 
like to hear it. This is the first time I’ve encountered anything about insert 
order mattering from a performance perspective, and it becomes a general-form 
question around how to handle low-cardinality fields.



Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-19 Thread Erick Erickson
15M docs may still comfortably fit in a single shard!
I've seen up to 300M docs fit on a shard. Then
again I've seen 10M docs make things unacceptably
slow.

You simply cannot extrapolate from 10K to
5M reliably. Put all 5M docs on the stand-alone
servers and test _that_. Whenever I see numbers
like 30K qps (assuming this is queries, not number
of docs indexed) I wonder if you're using the
same query over and over and hitting the query
result cache rather than doing any actual
searches.

But to answer your question (again). Sharding adds
overhead. There's no way to make that overhead
magically disappear. What you measure is what
you can expect, and you must measure.

Best,
Erick

On Tue, Jul 19, 2016 at 8:32 AM, Susheel Kumar  wrote:
> You may want to utilise Document routing (_route_) option to have your
> query serve faster but above you are trying to compare apple with oranges
> meaning your performance tests numbers have to be based on either your
> actual numbers like 3-5 million docs per shard or sufficient enough to see
> advantage of using sharding.  10K is nothing for your performance tests and
> will not give you anything.
>
> Otherwise as Eric mentioned don't shard  and add replica's if there is no
> need to distribute/divide data into shards.
>
>
> See
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
> https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options
>
>
> Thanks,
> Susheel
>
> On Tue, Jul 19, 2016 at 1:41 AM, kasimjinwala 
> wrote:
>
>> This is just for performance testing we have taken 10K records per shard.
>> In
>> live scenario it would be 30L-50L per shard. I want to search document from
>> all shards, it will slow down and take too long time.
>>
>> I know in case of solr Cloud, it will query all shard node and then return
>> result. Is there any way to search document in all shard with best
>> performance(qps)
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-19 Thread Susheel Kumar
You may want to utilise Document routing (_route_) option to have your
query serve faster but above you are trying to compare apple with oranges
meaning your performance tests numbers have to be based on either your
actual numbers like 3-5 million docs per shard or sufficient enough to see
advantage of using sharding.  10K is nothing for your performance tests and
will not give you anything.

Otherwise as Eric mentioned don't shard  and add replica's if there is no
need to distribute/divide data into shards.


See
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options


Thanks,
Susheel

On Tue, Jul 19, 2016 at 1:41 AM, kasimjinwala 
wrote:

> This is just for performance testing we have taken 10K records per shard.
> In
> live scenario it would be 30L-50L per shard. I want to search document from
> all shards, it will slow down and take too long time.
>
> I know in case of solr Cloud, it will query all shard node and then return
> result. Is there any way to search document in all shard with best
> performance(qps)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-19 Thread kasimjinwala
This is just for performance testing we have taken 10K records per shard. In
live scenario it would be 30L-50L per shard. I want to search document from
all shards, it will slow down and take too long time. 

I know in case of solr Cloud, it will query all shard node and then return
result. Is there any way to search document in all shard with best
performance(qps)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-18 Thread Erick Erickson
+1 to Susheel's question. Sharding inevitably adds
overhead. Roughly each shard is queried
for its top N docs (10 if, say, rows=10). The
doc ID and sort criteria (score by default) are returned
to the node that originally got the request. That node
then sorts the lists into the real top 10 to return to
the user. Then the node handling the request re-queries
the shards for the contents of those docs.

Sharding is a way to handle very large data sets, the
general recommendation is to shard _only_ when you
have too many documents to get good query perf
from a single shard.

If you need to increase QPS, add _replicas_ not shards.
Only go to sharding when you have too many documents
fit on your hardware.

Best,
Erick

On Mon, Jul 18, 2016 at 6:31 AM, Susheel Kumar  wrote:
> Hello,
>
> Question:  Do you really need sharding/can live without sharding since you
> mentioned only 10K records in one shard. What's your index/document size?
>
> Thanks,
> Susheel
>
> On Mon, Jul 18, 2016 at 2:08 AM, kasimjinwala 
> wrote:
>
>> currently I am using solrCloud 5.0 and I am facing query performance issue
>> while using 3 implicit shards, each shard contain around 10K records.
>> when I am specifying shards parameter(*shards=shard1*) in query it gives
>> 30K-35K qps. but while removing shards parameter from query it give
>> *1000-1500qps*. performance decreases drastically.
>>
>> please provide comment or suggestion to solve above issue
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-18 Thread Susheel Kumar
Hello,

Question:  Do you really need sharding/can live without sharding since you
mentioned only 10K records in one shard. What's your index/document size?

Thanks,
Susheel

On Mon, Jul 18, 2016 at 2:08 AM, kasimjinwala 
wrote:

> currently I am using solrCloud 5.0 and I am facing query performance issue
> while using 3 implicit shards, each shard contain around 10K records.
> when I am specifying shards parameter(*shards=shard1*) in query it gives
> 30K-35K qps. but while removing shards parameter from query it give
> *1000-1500qps*. performance decreases drastically.
>
> please provide comment or suggestion to solve above issue
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud - Query performance degrades with multiple servers(Shards)

2016-07-18 Thread kasimjinwala
currently I am using solrCloud 5.0 and I am facing query performance issue
while using 3 implicit shards, each shard contain around 10K records. 
when I am specifying shards parameter(*shards=shard1*) in query it gives
30K-35K qps. but while removing shards parameter from query it give
*1000-1500qps*. performance decreases drastically.

please provide comment or suggestion to solve above issue



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Query-performance-degrades-with-multiple-servers-tp4024660p4287600.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: measuring query performance & qps per node

2016-04-27 Thread Erick Erickson
In SolrCloud you can collect stats on pivot facets, see:
https://issues.apache.org/jira/browse/SOLR-6351

There are more buckets to count into and in SolrCloud you
have extra work to reconcile the partial results from
different shards.

Best,
Erick

On Mon, Apr 25, 2016 at 8:50 PM, Jay Potharaju  wrote:
> Thanks for the response Erick. I knew that it would depend on the number of
> factors like you mentioned.I just wanted to know whether a  good
> combination of queries, facets & filters should be a good estimate of how
> solr might behave.
>
> what did you mean by "Add stats to pivots in Cloud mode."
>
> Thanks
>
> On Mon, Apr 25, 2016 at 5:05 PM, Erick Erickson 
> wrote:
>
>>  Impossible to answer. For instance, a facet query can be very
>> heavy-duty. Add stats
>> to pivots in Cloud mode.
>>
>> As for using a bunch of fq clauses, It Depends (tm). If your expected usage
>> pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's
>> fine. It totally
>> falls down if, for instance, you have a bunch of facets. Or grouping.
>> Or.
>>
>> Best,
>> Erick
>>
>> On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju 
>> wrote:
>> > Hi,
>> > I am trying to measure how will are queries performing ie how long are
>> they
>> > taking. In order to measure query speed I am using solrmeter with 50k
>> > unique filter queries. And then checking if any of the queries are slower
>> > than 50ms. Is this a good approach to measure query performance?
>> >
>> > Are there any guidelines on how to measure if a given instance can
>> handle a
>> > given number of qps(query per sec)? For example if my doc size is 30
>> > million docs and index size is 40 GB of data and the RAM on the instance
>> is
>> > 60 GB, then how many qps can it handle? Or is this a hard question to
>> > answer and it depends on the load and type of query running at a given
>> time.
>> >
>> > --
>> > Thanks
>> > Jay
>>
>
>
>
> --
> Thanks
> Jay Potharaju


Re: measuring query performance & qps per node

2016-04-25 Thread Jay Potharaju
Thanks for the response Erick. I knew that it would depend on the number of
factors like you mentioned.I just wanted to know whether a  good
combination of queries, facets & filters should be a good estimate of how
solr might behave.

what did you mean by "Add stats to pivots in Cloud mode."

Thanks

On Mon, Apr 25, 2016 at 5:05 PM, Erick Erickson 
wrote:

>  Impossible to answer. For instance, a facet query can be very
> heavy-duty. Add stats
> to pivots in Cloud mode.
>
> As for using a bunch of fq clauses, It Depends (tm). If your expected usage
> pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's
> fine. It totally
> falls down if, for instance, you have a bunch of facets. Or grouping.
> Or.
>
> Best,
> Erick
>
> On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju 
> wrote:
> > Hi,
> > I am trying to measure how will are queries performing ie how long are
> they
> > taking. In order to measure query speed I am using solrmeter with 50k
> > unique filter queries. And then checking if any of the queries are slower
> > than 50ms. Is this a good approach to measure query performance?
> >
> > Are there any guidelines on how to measure if a given instance can
> handle a
> > given number of qps(query per sec)? For example if my doc size is 30
> > million docs and index size is 40 GB of data and the RAM on the instance
> is
> > 60 GB, then how many qps can it handle? Or is this a hard question to
> > answer and it depends on the load and type of query running at a given
> time.
> >
> > --
> > Thanks
> > Jay
>



-- 
Thanks
Jay Potharaju


Re: measuring query performance & qps per node

2016-04-25 Thread Erick Erickson
 Impossible to answer. For instance, a facet query can be very
heavy-duty. Add stats
to pivots in Cloud mode.

As for using a bunch of fq clauses, It Depends (tm). If your expected usage
pattern is all queries like 'q=*:*&fq=clause1&fq=clause2" then it's
fine. It totally
falls down if, for instance, you have a bunch of facets. Or grouping. Or.

Best,
Erick

On Mon, Apr 25, 2016 at 3:48 PM, Jay Potharaju  wrote:
> Hi,
> I am trying to measure how will are queries performing ie how long are they
> taking. In order to measure query speed I am using solrmeter with 50k
> unique filter queries. And then checking if any of the queries are slower
> than 50ms. Is this a good approach to measure query performance?
>
> Are there any guidelines on how to measure if a given instance can handle a
> given number of qps(query per sec)? For example if my doc size is 30
> million docs and index size is 40 GB of data and the RAM on the instance is
> 60 GB, then how many qps can it handle? Or is this a hard question to
> answer and it depends on the load and type of query running at a given time.
>
> --
> Thanks
> Jay


measuring query performance & qps per node

2016-04-25 Thread Jay Potharaju
Hi,
I am trying to measure how will are queries performing ie how long are they
taking. In order to measure query speed I am using solrmeter with 50k
unique filter queries. And then checking if any of the queries are slower
than 50ms. Is this a good approach to measure query performance?

Are there any guidelines on how to measure if a given instance can handle a
given number of qps(query per sec)? For example if my doc size is 30
million docs and index size is 40 GB of data and the RAM on the instance is
60 GB, then how many qps can it handle? Or is this a hard question to
answer and it depends on the load and type of query running at a given time.

-- 
Thanks
Jay


Re: normal solr query vs facet query performance

2016-04-18 Thread Shawn Heisey
On 4/18/2016 5:06 AM, Mugeesh Husain wrote:
> 1.)solr normal query(q=*:*) vs facet query(facet.query="abc") ?
> 2.)solr normal query(q=*:*) vs facet
> search(facet=tru&facet.field=coullumn_name) ?
> 3.)solr filter query(q=Column:some value) vs facet query(facet.query="abc")
> ?
> 4.)solr normal query(q=*:*) vs filter query(q=column:some value) ?

This is a question that is nearly impossible to answer without your
actual index, and even then only you can answer it.  You need to *try*
these queries and see what happens.

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Note that there is a performance bug with the *:* (MatchAllDocuments)
query on 5.x versions, which is only solved in 5.5.0 and later.  This
query runs quite a bit slower than it should.

https://issues.apache.org/jira/browse/SOLR-8251

Thanks,
Shawn



normal solr query vs facet query performance

2016-04-18 Thread Mugeesh Husain
Hello,

I am looking for which query will be fast in term of performance,

1.)solr normal query(q=*:*) vs facet query(facet.query="abc") ?
2.)solr normal query(q=*:*) vs facet
search(facet=tru&facet.field=coullumn_name) ?
3.)solr filter query(q=Column:some value) vs facet query(facet.query="abc")
?
4.)solr normal query(q=*:*) vs filter query(q=column:some value) ?



Also provide some good tutorial for above these things.


Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/normal-solr-query-vs-facet-query-performance-tp4270907.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Soft commit does not affecting query performance

2016-04-13 Thread Bhaumik Joshi
Hi Bill,


Please find below reference.

http://www.cloudera.com/documentation/enterprise/5-4-x/topics/search_tuning_solr.html
* "Enable soft commits and set the value to the largest value that 
meets your requirements. The default value of 1000 (1 second) is too aggressive 
for some environments."


Thanks & Regards,

Bhaumik Joshi



From: billnb...@gmail.com 
Sent: Monday, April 11, 2016 7:07 AM
To: solr-user@lucene.apache.org
Subject: Re: Soft commit does not affecting query performance

Why do you think it would ?

Bill Bell
Sent from mobile


> On Apr 11, 2016, at 7:48 AM, Bhaumik Joshi  wrote:
>
> Hi All,
>
> We are doing query performance test with different soft commit intervals. In 
> the test with 1sec of soft commit interval and 1min of soft commit interval 
> we didn't notice any improvement in query timings.
>
>
>
> We did test with SolrMeter (Standalone java tool for stress tests with Solr) 
> for 1sec soft commit and 1min soft commit.
>
> Index stats of test solr cloud: 0.7 million documents and 1 GB index size.
>
> Solr cloud has 2 shard and each shard has one replica.
>
>
>
> Please find below detailed test readings: (all timings are in milliseconds)
>
>
> Soft commit - 1sec
> Queries per sec Updates per sec   Total Queries   
>   Total Q time   Avg Q Time Total Client time 
>   Avg Client time
> 1  5  
> 100 44340 
>443 48834
> 488
> 5  5  
> 101 128914
>   1276   143239  1418
> 10   5
>   104 295325  
> 2839   330931  3182
> 25   5
>   102 675319  
> 6620   793874  7783
>
> Soft commit - 1min
> Queries per sec Updates per sec   Total Queries   
>   Total Q time   Avg Q Time Total Client time 
>   Avg Client time
> 1  5  
> 100 44292 
>442 48569
> 485
> 5  5  
> 105 131389
>   1251   147174  1401
> 10   5
>   102 299518  
> 2936   337748  3311
> 25   5
>   108     742639  
> 6876   865222  8011
>
> As theory suggests soft commit affects query performance but in my case it 
> doesn't. Can you put some light on this?
> Also suggest if I am missing something here.
>
> Regards,
> Bhaumik Joshi
>
>
>
>
>
>
>
>
>
>
> [Asite]
>
> The Hyperloop Station Design Competition - A 48hr design collaboration, from 
> mid-day, 23rd May 2016.
> REGISTER HERE http://www.buildearthlive.com/hyperloop
[http://www.buildearthlive.com/resources/images/BuildEarthLiveLogo-Hyperloop-2.png]<http://www.buildearthlive.com/hyperloop>

The Hyperloop Station Design Competition - Build Earth 
Live<http://www.buildearthlive.com/hyperloop>
www.buildearthlive.com
The Hyperloop Station Design Competition. A 48hr design collaboration, from 
mid-day,23rd May.



>
> [Build Earth Live Hyperloop]<http://www.buildearthlive.com/hyperloop>
>
> [CC Award Winners 2015]


Re: Soft commit does not affecting query performance

2016-04-11 Thread billnbell
Why do you think it would ?

Bill Bell
Sent from mobile


> On Apr 11, 2016, at 7:48 AM, Bhaumik Joshi  wrote:
> 
> Hi All,
> 
> We are doing query performance test with different soft commit intervals. In 
> the test with 1sec of soft commit interval and 1min of soft commit interval 
> we didn't notice any improvement in query timings.
> 
> 
> 
> We did test with SolrMeter (Standalone java tool for stress tests with Solr) 
> for 1sec soft commit and 1min soft commit.
> 
> Index stats of test solr cloud: 0.7 million documents and 1 GB index size.
> 
> Solr cloud has 2 shard and each shard has one replica.
> 
> 
> 
> Please find below detailed test readings: (all timings are in milliseconds)
> 
> 
> Soft commit - 1sec
> Queries per sec Updates per sec   Total Queries   
>   Total Q time   Avg Q Time Total Client time 
>   Avg Client time
> 1  5  
> 100 44340 
>443 48834
> 488
> 5  5  
> 101 128914
>   1276   143239  1418
> 10   5
>   104 295325  
> 2839   330931  3182
> 25   5
>   102 675319  
> 6620   793874  7783
> 
> Soft commit - 1min
> Queries per sec Updates per sec   Total Queries   
>   Total Q time   Avg Q Time Total Client time 
>   Avg Client time
> 1  5  
> 100 44292 
>442 48569
> 485
> 5  5  
> 105 131389
>   1251   147174  1401
> 10   5
>   102 299518  
> 2936   337748  3311
> 25   5
>   108         742639  
> 6876   865222  8011
> 
> As theory suggests soft commit affects query performance but in my case it 
> doesn't. Can you put some light on this?
> Also suggest if I am missing something here.
> 
> Regards,
> Bhaumik Joshi
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> [Asite]
> 
> The Hyperloop Station Design Competition - A 48hr design collaboration, from 
> mid-day, 23rd May 2016.
> REGISTER HERE http://www.buildearthlive.com/hyperloop
> 
> [Build Earth Live Hyperloop]<http://www.buildearthlive.com/hyperloop>
> 
> [CC Award Winners 2015]


Soft commit does not affecting query performance

2016-04-11 Thread Bhaumik Joshi
Hi All,

We are doing query performance test with different soft commit intervals. In 
the test with 1sec of soft commit interval and 1min of soft commit interval we 
didn't notice any improvement in query timings.



We did test with SolrMeter (Standalone java tool for stress tests with Solr) 
for 1sec soft commit and 1min soft commit.

Index stats of test solr cloud: 0.7 million documents and 1 GB index size.

Solr cloud has 2 shard and each shard has one replica.



Please find below detailed test readings: (all timings are in milliseconds)


Soft commit - 1sec
Queries per sec Updates per sec   Total Queries 
Total Q time   Avg Q Time Total Client time   
Avg Client time
1  5
  100 44340 
   443 48834488
5  5
  101 128914
  1276   143239  1418
10   5  
104 295325  
2839   330931  3182
25   5  
102 675319  
6620   793874  7783

Soft commit - 1min
Queries per sec Updates per sec   Total Queries 
Total Q time   Avg Q Time Total Client time   
Avg Client time
1  5
  100 44292 
   442 48569485
5  5
  105 131389
  1251   147174  1401
10   5  
102 299518  
2936   337748  3311
25   5  
108 742639  
6876   865222  8011

As theory suggests soft commit affects query performance but in my case it 
doesn't. Can you put some light on this?
Also suggest if I am missing something here.

Regards,
Bhaumik Joshi










[Asite]

The Hyperloop Station Design Competition - A 48hr design collaboration, from 
mid-day, 23rd May 2016.
REGISTER HERE http://www.buildearthlive.com/hyperloop

[Build Earth Live Hyperloop]<http://www.buildearthlive.com/hyperloop>

[CC Award Winners 2015]


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Upayavira,

I happened to compose individual fq for each field, such as:
fq=Gatewaycode:(...)&fq=DestCode:(...)&fq=DateDep:(...)&fq=Duration:(...)

It is nice to know that I am not creating unnecessary cache entries since
the above method results in minimal carnality as you pointed out.

Thank





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223988.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Upayavira
Yes, you can limit the size of the filter cache, as Erick says, but
then, you could just end up with cache churn, where you are constantly
re-populating your cache as stuff gets pushed out, only to have to
regenerate it again for the next query.

Is it possible to decompose these queries into parts?

fq=+category:sport +year:2015

could be better expressed as:
fq=category:sport
fq=year:2015

Instead of resulting in cardinality(category) * cardinality(year) cache
entries, you'd have cardinality(category) + cardinality(year).

cardinality() here simply means the number of unique values for that
field.

Upayavira

On Wed, Aug 19, 2015, at 05:23 PM, Erick Erickson wrote:
> bq:  can I limit the size of the three
> caches so that the RAM usage will be under control
> 
> That's exactly what the "size" parameter is for.
> 
> As Upayavira says, the rough size of each entry in
> the filterCache is maxDocs/8 + (sizeof query string).
> 
> queryResultCache is much smaller per entry, it's
> roughly (sizeof entire query) + ((sizeof Java int) *
> )
> 
>  is from solrconfig.xml. The point
> here is this is rarely very bug unless you make the
> queryResultCache huge.
> 
> As for documentResultCache, it's also usually not
> very large, it's the (size you declare it) * (average size of a doc).
> 
> Best,
> Erick
> 
> On Wed, Aug 19, 2015 at 9:12 AM, wwang525  wrote:
> > Hi Upayavira,
> >
> > Thank you very much for pointing out the potential design issue
> >
> > The queries will be determined through a configuration by business users.
> > There will be limited number of queries every day, and will get executed by
> > customers repeatedly. However, business users will change the configurations
> > so that new queries will get generated and also will be limited. The change
> > can be as frequent as daily or weekly. The project is to supporting daily
> > promotional based on fresh index data.
> >
> > Cumulatively, there can be a lot of different queries. If I still want to
> > take the advantage of the filterCache, can I limit the size of the three
> > caches so that the RAM usage will be under control?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
> > Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Erick Erickson
bq:  can I limit the size of the three
caches so that the RAM usage will be under control

That's exactly what the "size" parameter is for.

As Upayavira says, the rough size of each entry in
the filterCache is maxDocs/8 + (sizeof query string).

queryResultCache is much smaller per entry, it's
roughly (sizeof entire query) + ((sizeof Java int) * )

 is from solrconfig.xml. The point
here is this is rarely very bug unless you make the
queryResultCache huge.

As for documentResultCache, it's also usually not
very large, it's the (size you declare it) * (average size of a doc).

Best,
Erick

On Wed, Aug 19, 2015 at 9:12 AM, wwang525  wrote:
> Hi Upayavira,
>
> Thank you very much for pointing out the potential design issue
>
> The queries will be determined through a configuration by business users.
> There will be limited number of queries every day, and will get executed by
> customers repeatedly. However, business users will change the configurations
> so that new queries will get generated and also will be limited. The change
> can be as frequent as daily or weekly. The project is to supporting daily
> promotional based on fresh index data.
>
> Cumulatively, there can be a lot of different queries. If I still want to
> take the advantage of the filterCache, can I limit the size of the three
> caches so that the RAM usage will be under control?
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Upayavira,

Thank you very much for pointing out the potential design issue

The queries will be determined through a configuration by business users.
There will be limited number of queries every day, and will get executed by
customers repeatedly. However, business users will change the configurations
so that new queries will get generated and also will be limited. The change
can be as frequent as daily or weekly. The project is to supporting daily
promotional based on fresh index data.

Cumulatively, there can be a lot of different queries. If I still want to
take the advantage of the filterCache, can I limit the size of the three
caches so that the RAM usage will be under control?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Upayavira
You say "all of my queries are based upon fq"? Why? How unique are they?
Remember, for each fq value, it could end up storing one bit per
document in your index. If you have 8m documents, you could end up with
a cache usage of 1Mb, for that query alone!

Filter queries are primarily designed for queries that are repeated,
e.g.: category:sport, where caching gives some advantage.

If all of your queries are unique, then move them to the q= parameter,
or make them fq={!cache=false}, otherwise you will waste memory storing
cached values that are never used, and CPU building and then destroying
those cached entries.

Upayavira

On Wed, Aug 19, 2015, at 02:25 PM, wwang525 wrote:
> Hi Erick,
> 
> All my queries are based on fq (filter query). I have to send the
> randomly
> generated queries to warm up low level lucene cache.
> 
> I went to the more tedious way to warm up low level cache without
> utilizing
> the three caches by turning off the three caches (set values to zero).
> Then,
> I send 800 randomly generated request to Solr. The RAM jumped from 500MB
> to
> 2.5G, and stayed there.
> 
> Then, I test individual queries against Solr. This time, I got very close
> response time when I requested the first time, second time, or third
> time. 
> 
> The results: 
> 
> (1) average response time: 803 ms with only one request having a response
> time >1 second (1042 ms)
> (2) the majority of the time was spent on query, and not on faceting 
> (730/803 = 90%)
> 
> So the query is the bottleneck.
> 
> I also have an interesting finding: it looks like the fq query works
> better
> with integer type. I created string type for two properties: DateDep and
> Duration since the definition of docValues=true for integer type did not
> work with faceted search. There was a time I accidentally used filter
> query
> with the string type property and I found the query performance degraded
> quite a lot.
> 
> Is it generally true that fq works better with integer type  ?
> 
> If this is the case, I could create two integer type properties for two
> other fq to check if I can boost the performance.
> 
> Thanks
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Erick,

All my queries are based on fq (filter query). I have to send the randomly
generated queries to warm up low level lucene cache.

I went to the more tedious way to warm up low level cache without utilizing
the three caches by turning off the three caches (set values to zero). Then,
I send 800 randomly generated request to Solr. The RAM jumped from 500MB to
2.5G, and stayed there.

Then, I test individual queries against Solr. This time, I got very close
response time when I requested the first time, second time, or third time. 

The results: 

(1) average response time: 803 ms with only one request having a response
time >1 second (1042 ms)
(2) the majority of the time was spent on query, and not on faceting 
(730/803 = 90%)

So the query is the bottleneck.

I also have an interesting finding: it looks like the fq query works better
with integer type. I created string type for two properties: DateDep and
Duration since the definition of docValues=true for integer type did not
work with faceted search. There was a time I accidentally used filter query
with the string type property and I found the query performance degraded
quite a lot.

Is it generally true that fq works better with integer type  ?

If this is the case, I could create two integer type properties for two
other fq to check if I can boost the performance.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread Erick Erickson
bq: can I turn off the three cache and send a lot of queries to Solr

I really think you're missing the easiest way to do that.
To not put anything in the filter cache, just don't send any fq clauses.

As far as the doc cache is concerned, by and large I just wouldn't
worry about it. With MMapDirectory, it's less valuable than it was
when it was created. It's primary usage is that the components in a
single query don't have to re-read the docs from disk. As far as
the queryResultCache, by not putting fq clauses on the warmup
queries you won't hit this cache next time around.

Best,
Erick

On Tue, Aug 18, 2015 at 1:17 PM, wwang525  wrote:
> Hi Erick,
>
> I just tested 10 different queries with or without the faceting search on
> the two properties : departure_date, and hotel_code. Under cold cache
> scenario, they have pretty much the same response time, and the faceting
> took much less time than the query time. Under cold cache scenario, the
> "query" (under timing)  is still the "bottleneck".
>
> I understand that the low level cache needs to be warmed up to do a more
> realistic test. However, I do not have a good and consistent way to warm up
> the low level cache without caching the filter queries at the same time. If
> I load test some random queries before I test these 10 individual queries, I
> can see a better response time in some cases, but that could also be due to
> filter query cache.
>
> To load up low level lucene cache without creating filtercache/document
> cache etc, can I turn off the three cache and send a lot of queries to Solr
> before I start to test the performance of each individual queries?
>
> Thanks
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223758.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread wwang525
Hi Erick,

I just tested 10 different queries with or without the faceting search on
the two properties : departure_date, and hotel_code. Under cold cache
scenario, they have pretty much the same response time, and the faceting
took much less time than the query time. Under cold cache scenario, the
"query" (under timing)  is still the "bottleneck".

I understand that the low level cache needs to be warmed up to do a more
realistic test. However, I do not have a good and consistent way to warm up
the low level cache without caching the filter queries at the same time. If
I load test some random queries before I test these 10 individual queries, I
can see a better response time in some cases, but that could also be due to
filter query cache.

To load up low level lucene cache without creating filtercache/document
cache etc, can I turn off the three cache and send a lot of queries to Solr
before I start to test the performance of each individual queries?

Thanks






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223758.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread Erick Erickson
those are not that high. I was thinking of facets with thousands to
tens-of-thousands of unique values. I really wouldn't expect this to
be a huge hit unless you're querying all docs.

Let us know what you find.

Best,
Erick

On Tue, Aug 18, 2015 at 11:31 AM, wwang525  wrote:
> Hi Erick,
>
> Two facets are probably demanding:
>
> departure_date have 365 distinct values and hotel_code can have 800 distinct
> values.
>
> The docValues setting definitely helped me a lot even when all the queries
> had the above two facets. I will test a list of queries with or without the
> two facets after indexing the data (to take advantage of cache warming).
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223744.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread wwang525
Hi Erick,

Two facets are probably demanding:

departure_date have 365 distinct values and hotel_code can have 800 distinct
values.

The docValues setting definitely helped me a lot even when all the queries
had the above two facets. I will test a list of queries with or without the
two facets after indexing the data (to take advantage of cache warming).

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223744.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread Erick Erickson
erage response time around 1
> second.
>
> If I execute a load test again, the average response time will continue
> drop. However, it stays at about 500 ms/per request under this load if I try
> more tests.
>
> These are the best results so far.
>
> I understand that the requests were all different, so it can not be compared
> with the case where I execute the same query twice (usually give me a
> response time around 150 ms).
>
> In production environment, many requests may be very similar so that the
> filter queries will be executed faster. However, these tests generate all
> random requests, and is different than that of production environment.
>
> In addition, the feature of "warming up cache" may not be applicable to my
> test scenarios due to randomly generated requests for all tests.
>
> I tried to use other search solutions, and the performance was not good.
> That was why I tried to use Solr. Now that I am using Solr, I would like to
> know In a typical Solr project:
>
> (1) if it is a good response time for this data size without taking too much
> advantage of cache?
> (2) if it is possible to improve even further without data sharding? For
> example, to get an average of  less than 200 ms response time
>
> Additional information to share:
> (1) The tests were done when the Solr instance was not indexing. CPU was
> dedicated to the test and RAM was enough.
>
> (2) most of the setting in solrconfig.xml are default. However, cache
> setting were modified.
> Note, I think the autowarmCount setting may not be very beneficial to my
> tests due to randomly generated requests. However, I still got >50% hit
> ratio for filter queries. This is due to the limited values for some filter
> queries.
>
>class="solr.FastLRUCache"
>   size="4096"
>   initialSize="1024"
>   autowarmCount="32"/>
>
>class="solr.LRUCache"
>   size="512"
>   initialSize="512"
>   autowarmCount="32"/>
>
> class="solr.LRUCache"
>   size="1"
>   initialSize="256"
>   autowarmCount="0"/>
>
>
> Thanks
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Is it a good query performance with this data size ?

2015-08-18 Thread wwang525
Hi All,

I am working on a search service based on Solr (v5.1.0). The data size is 15
M records. The size of the index files is 860MB. The test was performed on a
local machine that has 8 cores with 32 G memory and CPU is 3.4Ghz (Intel
Core i7-3770). 

I found out that setting docValues=true for faceting and grouping indeed
boosted the performance with first-time search under cold cache scenario.
For example, with our requests that use all the features like grouping,
sorting, faceting, I found the difference of faceting alone can be as much
as 300 ms. 

However, response time for the same request executed the second time seems
to be at the same level whether the setting of docValues is true or false.
Still, I set up docValues=true for all the faceting properties.

The following are what I have observed:

(1) Test single request one-by-one (no load)

With a cold cache, I execute randomly generated queries one after another.
The first query routinely exceed 1 second, but not usually more than 2
seconds. I continue to generate random requests, and execute the queries
one-by-one, the response time normally stabilized at the range of 500 ms. It
does not seem to improve more as I continue execute randomly generated
queries.

(2) Load test with randomly generated requests

Under load test scenario (each core takes 4 requests per second, and
continue for 20 round), I can see the CPU usage jumped, and the earlier
requests usually got much longer response time, they may even exceed 5
seconds. However, the CPU usage pattern will then changed to the SAW shape,
and the response time will drop, and I can see that the requests got
executed faster and faster. I usually gets an average response time around 1
second.

If I execute a load test again, the average response time will continue
drop. However, it stays at about 500 ms/per request under this load if I try
more tests.

These are the best results so far. 

I understand that the requests were all different, so it can not be compared
with the case where I execute the same query twice (usually give me a
response time around 150 ms). 

In production environment, many requests may be very similar so that the
filter queries will be executed faster. However, these tests generate all
random requests, and is different than that of production environment.

In addition, the feature of "warming up cache" may not be applicable to my
test scenarios due to randomly generated requests for all tests. 

I tried to use other search solutions, and the performance was not good.
That was why I tried to use Solr. Now that I am using Solr, I would like to
know In a typical Solr project:

(1) if it is a good response time for this data size without taking too much
advantage of cache? 
(2) if it is possible to improve even further without data sharding? For
example, to get an average of  less than 200 ms response time

Additional information to share:
(1) The tests were done when the Solr instance was not indexing. CPU was
dedicated to the test and RAM was enough.

(2) most of the setting in solrconfig.xml are default. However, cache
setting were modified. 
Note, I think the autowarmCount setting may not be very beneficial to my
tests due to randomly generated requests. However, I still got >50% hit
ratio for filter queries. This is due to the limited values for some filter
queries.





 
   

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query Performance

2015-07-21 Thread Nagasharath
I tried using SolrMeter but for some reason it does not detect my url and 
throws solr server exception

Sent from my iPhone

> On 21-Jul-2015, at 10:58 am, Alessandro Benedetti 
>  wrote:
> 
> SolrMeter mate,
> 
> http://code.google.com/p/solrmeter/
> 
> Take a look, it will help you a lot !
> 
> Cheers
> 
> 2015-07-21 16:49 GMT+01:00 Nagasharath :
> 
>> Any recommended tool to test the query performance would be of great help.
>> 
>> Thanks
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England


Re: Query Performance

2015-07-21 Thread Alessandro Benedetti
SolrMeter mate,

http://code.google.com/p/solrmeter/

Take a look, it will help you a lot !

Cheers

2015-07-21 16:49 GMT+01:00 Nagasharath :

> Any recommended tool to test the query performance would be of great help.
>
> Thanks
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Query Performance

2015-07-21 Thread Nagasharath
Any recommended tool to test the query performance would be of great help.

Thanks


Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
Shawn, thank you very much for that explanation.  It helps a lot.

Cheers, Ryan

On Wed, May 20, 2015 at 5:07 PM, Shawn Heisey  wrote:

> On 5/20/2015 5:57 PM, Ryan Cutter wrote:
> > GC is operating the way I think it should but I am lacking memory.  I am
> > just surprised because indexing is performing fine (documents going in)
> but
> > deletions are really bad (documents coming out).
> >
> > Is it possible these deletes are hitting many segments, each of which I
> > assume must be re-built?  And if there isn't much slack memory laying
> > around to begin with, there's a bunch of contention/swap?
>
> A deleteByQuery must first query the entire index to determine which IDs
> to delete.  That's going to hit every segment.  In the case of
> SolrCloud, it will also hit at least one replica of every single shard
> in the collection.
>
> If the data required to satisfy the query is not already sitting in the
> OS disk cache, then the actual disk must be read.  When RAM is extremely
> tight, any disk operation will erase relevant data out of the OS disk
> cache, so the next time it is needed, it must be read off the disk
> again.  Disks are SLOW.  What I am describing is not swap, but the
> performance impact is similar to swapping.
>
> The actual delete operation (once the IDs are known) doesn't touch any
> segments ... it writes Lucene document identifiers to a .del file, and
> that file is consulted on all queries.  Any deleted documents found in
> the query results are removed.
>
> Thanks,
> Shawn
>
>


Re: SolrCloud delete by query performance

2015-05-20 Thread Shawn Heisey
On 5/20/2015 5:57 PM, Ryan Cutter wrote:
> GC is operating the way I think it should but I am lacking memory.  I am
> just surprised because indexing is performing fine (documents going in) but
> deletions are really bad (documents coming out).
> 
> Is it possible these deletes are hitting many segments, each of which I
> assume must be re-built?  And if there isn't much slack memory laying
> around to begin with, there's a bunch of contention/swap?

A deleteByQuery must first query the entire index to determine which IDs
to delete.  That's going to hit every segment.  In the case of
SolrCloud, it will also hit at least one replica of every single shard
in the collection.

If the data required to satisfy the query is not already sitting in the
OS disk cache, then the actual disk must be read.  When RAM is extremely
tight, any disk operation will erase relevant data out of the OS disk
cache, so the next time it is needed, it must be read off the disk
again.  Disks are SLOW.  What I am describing is not swap, but the
performance impact is similar to swapping.

The actual delete operation (once the IDs are known) doesn't touch any
segments ... it writes Lucene document identifiers to a .del file, and
that file is consulted on all queries.  Any deleted documents found in
the query results are removed.

Thanks,
Shawn



Re: SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
GC is operating the way I think it should but I am lacking memory.  I am
just surprised because indexing is performing fine (documents going in) but
deletions are really bad (documents coming out).

Is it possible these deletes are hitting many segments, each of which I
assume must be re-built?  And if there isn't much slack memory laying
around to begin with, there's a bunch of contention/swap?

Thanks Shawn!

On Wed, May 20, 2015 at 4:50 PM, Shawn Heisey  wrote:

> On 5/20/2015 5:41 PM, Ryan Cutter wrote:
> > I have a collection with 1 billion documents and I want to delete 500 of
> > them.  The collection has a dozen shards and a couple replicas.  Using
> Solr
> > 4.4.
> >
> > Sent the delete query via HTTP:
> >
> > http://hostname:8983/solr/my_collection/update?stream.body=
> > source:foo
> >
> > Took a couple minutes and several replicas got knocked into Recovery
> mode.
> > They eventually came back and the desired docs were deleted but the
> cluster
> > wasn't thrilled (high load, etc).
> >
> > Is this expected behavior?  Is there a better way to delete documents
> that
> > I'm missing?
>
> That's the correct way to do the delete.  Before you'll see the change,
> a commit must happen in one way or another.  Hopefully you already knew
> that.
>
> I believe that your setup has some performance issues that are making it
> very slow and knocking out your Solr nodes temporarily.
>
> The most common root problems with SolrCloud and indexes going into
> recovery are:  1) Your heap is enormous but your garbage collection is
> not tuned.  2) You don't have enough RAM, separate from your Java heap,
> for adequate index caching.  With a billion documents in your
> collection, you might even be having problems with both.
>
> Here's a wiki page that includes some info on both of these problems,
> plus a few others:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>


Re: SolrCloud delete by query performance

2015-05-20 Thread Shawn Heisey
On 5/20/2015 5:41 PM, Ryan Cutter wrote:
> I have a collection with 1 billion documents and I want to delete 500 of
> them.  The collection has a dozen shards and a couple replicas.  Using Solr
> 4.4.
> 
> Sent the delete query via HTTP:
> 
> http://hostname:8983/solr/my_collection/update?stream.body=
> source:foo
> 
> Took a couple minutes and several replicas got knocked into Recovery mode.
> They eventually came back and the desired docs were deleted but the cluster
> wasn't thrilled (high load, etc).
> 
> Is this expected behavior?  Is there a better way to delete documents that
> I'm missing?

That's the correct way to do the delete.  Before you'll see the change,
a commit must happen in one way or another.  Hopefully you already knew
that.

I believe that your setup has some performance issues that are making it
very slow and knocking out your Solr nodes temporarily.

The most common root problems with SolrCloud and indexes going into
recovery are:  1) Your heap is enormous but your garbage collection is
not tuned.  2) You don't have enough RAM, separate from your Java heap,
for adequate index caching.  With a billion documents in your
collection, you might even be having problems with both.

Here's a wiki page that includes some info on both of these problems,
plus a few others:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



SolrCloud delete by query performance

2015-05-20 Thread Ryan Cutter
I have a collection with 1 billion documents and I want to delete 500 of
them.  The collection has a dozen shards and a couple replicas.  Using Solr
4.4.

Sent the delete query via HTTP:

http://hostname:8983/solr/my_collection/update?stream.body=
source:foo

Took a couple minutes and several replicas got knocked into Recovery mode.
They eventually came back and the desired docs were deleted but the cluster
wasn't thrilled (high load, etc).

Is this expected behavior?  Is there a better way to delete documents that
I'm missing?

Thanks, Ryan


Re: SolrCloud: query performance while indexing

2014-01-16 Thread Michael Della Bitta
Hi, Will,

Have you investigated not using EBS volumes at all? I'm not sure what node
size you're using, but for example, you can build a RAID 0 out of the four
instance volumes on an m1.xlarge and get lots of disk bandwidth. Also,
there's some nice SSD instances available now. http://www.ec2instances.info/

That's assuming disk throughput is your problem. Have you tried using
iostat or top to discover what your iowait% is during these merges?


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Jan 16, 2014 at 3:08 PM, Will Butler  wrote:

> We currently have a SolrCloud cluster that contains two collections which
> we toggle between for querying and indexing. When bulk indexing to our
> “offline" collection, our query performance from the “online” collection
> suffers somewhat. When segment merges occur, it gets downright abysmal. We
> have adjusted several settings that affect flushing and/or merging and have
> tried increasing the IOPs capacity of our volumes, without much success.
> The best recommendation seems to be to simply have enough ram on each node
> for the index to fit into memory (plus additional memory which may be
> required for indexing). If this isn’t feasible, it seems that there is no
> way around the fact that flushes and merges will potentially take up IO
> resources needed for responding to queries. We are currently experimenting
> with throttling flushes and merges using maxWriteMBPerSec* settings, which
> seems to help if set to fairly low values. Does anyone have any other
> recommendations for optimizing SolrCloud to handle both heavy indexing and
> querying?
>
> Thanks,
>
> Will


SolrCloud: query performance while indexing

2014-01-16 Thread Will Butler
We currently have a SolrCloud cluster that contains two collections which we 
toggle between for querying and indexing. When bulk indexing to our “offline" 
collection, our query performance from the “online” collection suffers 
somewhat. When segment merges occur, it gets downright abysmal. We have 
adjusted several settings that affect flushing and/or merging and have tried 
increasing the IOPs capacity of our volumes, without much success. The best 
recommendation seems to be to simply have enough ram on each node for the index 
to fit into memory (plus additional memory which may be required for indexing). 
If this isn’t feasible, it seems that there is no way around the fact that 
flushes and merges will potentially take up IO resources needed for responding 
to queries. We are currently experimenting with throttling flushes and merges 
using maxWriteMBPerSec* settings, which seems to help if set to fairly low 
values. Does anyone have any other recommendations for optimizing SolrCloud to 
handle both heavy indexing and querying?

Thanks,

Will

Re: Solrj Query Performance

2013-11-28 Thread Shawn Heisey
On 11/28/2013 3:01 AM, Ahmet Arslan wrote:
> Are you sure you are using the same exact parameters? I would include 
> enhoParams=all and compare parameters. Only wt parameter would be different. 
> wt=javabin for solrJ 

You can also look at the Solr log, which if you are logging at the
normal level of INFO, will contain all parameters used on each query,
and compare the two.  There is probably some critical difference.

Thanks,
Shawn



Re: Solrj Query Performance

2013-11-28 Thread Ahmet Arslan
Hi Parsi,

Are you sure you are using the same exact parameters? I would include 
enhoParams=all and compare parameters. Only wt parameter would be different. 
wt=javabin for solrJ 



On Thursday, November 28, 2013 11:42 AM, Prasi S  wrote:

Hi,
We recently saw a behavior which I wanted to confirm, WE are using solrj to
query solr. From the code, we use HttpSolrServer to hit the query and
return the response

1. When a sample query is hit using Solrj, we get the QTime as 4seconds.
The same query when we hit against solr in the browser, we get it in
50milliseconds.

Initially we thought it was because of caching.

But then, we tried the reverse way. We hit a new query to solr in the
browser first. We got in milliseconds. Then we used Solrj, it came to 4.5
seconds. ( We take the QTime from the response object Header.

Is this anything to do with Solrj's internal implementation?

Thanks,
Prasi


Solrj Query Performance

2013-11-28 Thread Prasi S
Hi,
We recently saw a behavior which I wanted to confirm, WE are using solrj to
query solr. From the code, we use HttpSolrServer to hit the query and
return the response

1. When a sample query is hit using Solrj, we get the QTime as 4seconds.
The same query when we hit against solr in the browser, we get it in
50milliseconds.

Initially we thought it was because of caching.

But then, we tried the reverse way. We hit a new query to solr in the
browser first. We got in milliseconds. Then we used Solrj, it came to 4.5
seconds. ( We take the QTime from the response object Header.

Is this anything to do with Solrj's internal implementation?

Thanks,
Prasi


Re: Cross index join query performance

2013-09-30 Thread Peter Keegan
Ah, got it now - thanks for the explanation.


On Sat, Sep 28, 2013 at 3:33 AM, Upayavira  wrote:

> The thing here is to understand how a join works.
>
> Effectively, it does the inner query first, which results in a list of
> terms. It then effectively does a multi-term query with those values.
>
> q=size:large {!join fromIndex=other from=someid
> to=someotherid}type:shirt
>
> Imagine the inner join returned values A,B,C. Your inner query is, on
> core 'other', q=type:shirt&fl=someid.
>
> Then your outer query becomes size:large someotherid:(A B C)
>
> Your inner query returns 25k values. You're having to do a multi-term
> query for 25k terms. That is *bound* to be slow.
>
> The pseudo-joins in Solr 4.x are intended for a small to medium number
> of values returned by the inner query, otherwise performance degrades as
> you are seeing.
>
> Is there a way you can reduce the number of values returned by the inner
> query?
>
> As Joel mentions, those other joins are attempts to find other ways to
> work with this limitation.
>
> Upayavira
>
> On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
> > Hi Joel,
> >
> > I tried this patch and it is quite a bit faster. Using the same query on
> > a
> > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
> > QTime was 100 msec! This was for true for large and small result sets.
> >
> > A few notes: the patch didn't compile with 4.3 because of the
> > SolrCore.getLatestSchema call (which I worked around), and the package
> > name
> > should be:
> >  > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
> >
> > Unfortunately, I just learned that our uniqueKey may have to be an
> > alphanumeric string instead of an int, so I'm not out of the woods yet.
> >
> > Good stuff - thanks.
> >
> > Peter
> >
> >
> > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein 
> > wrote:
> >
> > > It looks like you are using int join keys so you may want to check out
> > > SOLR-4787, specifically the hjoin and bjoin.
> > >
> > > These perform well when you have a large number of results from the
> > > fromIndex. If you have a small number of results in the fromIndex the
> > > standard join will be faster.
> > >
> > >
> > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan  > > >wrote:
> > >
> > > > I forgot to mention - this is Solr 4.3
> > > >
> > > > Peter
> > > >
> > > >
> > > >
> > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <
> peterlkee...@gmail.com
> > > > >wrote:
> > > >
> > > > > I'm doing a cross-core join query and the join query is 30X slower
> than
> > > > > each of the 2 individual queries. Here are the queries:
> > > > >
> > > > > Main query:
> http://localhost:8983/solr/mainindex/select?q=title:java
> > > > > QTime: 5 msec
> > > > > hit count: 1000
> > > > >
> > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO
> > > > 0.3]
> > > > > QTime: 4 msec
> > > > > hit count: 25K
> > > > >
> > > > > Join query:
> > > > >
> > > >
> > >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1
>  TO 0.3]
> > > > > QTime: 160 msec
> > > > > hit count: 205
> > > > >
> > > > > Here are the index spec's:
> > > > >
> > > > > mainindex size: 117K docs, 1 segment
> > > > > mainindex schema:
> > > > > > > > > required="true" multiValued="false" />
> > > > > > > > > stored="true" multiValued="false" />
> > > > >docid
> > > > >
> > > > > subindex size: 117K docs, 1 segment
> > > > > subindex schema:
> > > > > > > > > required="true" multiValued="false" />
> > > > > > > > > required="false" multiValued="false" />
> > > > >docid
> > > > >
> > > > > With debugQuery=true I see:
> > > > >   "debug":{
> > > > > "join":{
> > > > >   "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> > > 0.3]":{
> > > > > "time":155,
> > > > > "fromSetSize":24742,
> > > > > "toSetSize":24742,
> > > > > "fromTermCount":117810,
> > > > > "fromTermTotalDf":117810,
> > > > > "fromTermDirectCount":117810,
> > > > > "fromTermHits":24742,
> > > > > "fromTermHitsTotalDf":24742,
> > > > > "toTermHits":24742,
> > > > > "toTermHitsTotalDf":24742,
> > > > > "toTermDirectCount":24627,
> > > > > "smallSetsDeferred":115,
> > > > > "toSetDocsAdded":24742}},
> > > > >
> > > > > Via profiler and debugger, I see 150 msec spent in the outer
> > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This
> seems
> > > > like a
> > > > > lot of time to join the bitsets. Does this seem right?
> > > > >
> > > > > Peter
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Joel Bernstein
> > > Professional Services LucidWorks
> > >
>


Re: Cross index join query performance

2013-09-28 Thread Upayavira
The thing here is to understand how a join works.

Effectively, it does the inner query first, which results in a list of
terms. It then effectively does a multi-term query with those values.

q=size:large {!join fromIndex=other from=someid
to=someotherid}type:shirt

Imagine the inner join returned values A,B,C. Your inner query is, on
core 'other', q=type:shirt&fl=someid.

Then your outer query becomes size:large someotherid:(A B C)

Your inner query returns 25k values. You're having to do a multi-term
query for 25k terms. That is *bound* to be slow.

The pseudo-joins in Solr 4.x are intended for a small to medium number
of values returned by the inner query, otherwise performance degrades as
you are seeing.

Is there a way you can reduce the number of values returned by the inner
query?

As Joel mentions, those other joins are attempts to find other ways to
work with this limitation.

Upayavira

On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
> Hi Joel,
> 
> I tried this patch and it is quite a bit faster. Using the same query on
> a
> larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
> QTime was 100 msec! This was for true for large and small result sets.
> 
> A few notes: the patch didn't compile with 4.3 because of the
> SolrCore.getLatestSchema call (which I worked around), and the package
> name
> should be:
>  class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
> 
> Unfortunately, I just learned that our uniqueKey may have to be an
> alphanumeric string instead of an int, so I'm not out of the woods yet.
> 
> Good stuff - thanks.
> 
> Peter
> 
> 
> On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein 
> wrote:
> 
> > It looks like you are using int join keys so you may want to check out
> > SOLR-4787, specifically the hjoin and bjoin.
> >
> > These perform well when you have a large number of results from the
> > fromIndex. If you have a small number of results in the fromIndex the
> > standard join will be faster.
> >
> >
> > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan  > >wrote:
> >
> > > I forgot to mention - this is Solr 4.3
> > >
> > > Peter
> > >
> > >
> > >
> > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan  > > >wrote:
> > >
> > > > I'm doing a cross-core join query and the join query is 30X slower than
> > > > each of the 2 individual queries. Here are the queries:
> > > >
> > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > > > QTime: 5 msec
> > > > hit count: 1000
> > > >
> > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> > > 0.3]
> > > > QTime: 4 msec
> > > > hit count: 25K
> > > >
> > > > Join query:
> > > >
> > >
> > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid
> >  to=docid}fld1:[0.1 TO 0.3]
> > > > QTime: 160 msec
> > > > hit count: 205
> > > >
> > > > Here are the index spec's:
> > > >
> > > > mainindex size: 117K docs, 1 segment
> > > > mainindex schema:
> > > > > > > required="true" multiValued="false" />
> > > > > > > stored="true" multiValued="false" />
> > > >docid
> > > >
> > > > subindex size: 117K docs, 1 segment
> > > > subindex schema:
> > > > > > > required="true" multiValued="false" />
> > > > > > > required="false" multiValued="false" />
> > > >docid
> > > >
> > > > With debugQuery=true I see:
> > > >   "debug":{
> > > > "join":{
> > > >   "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> > 0.3]":{
> > > > "time":155,
> > > > "fromSetSize":24742,
> > > > "toSetSize":24742,
> > > > "fromTermCount":117810,
> > > > "fromTermTotalDf":117810,
> > > > "fromTermDirectCount":117810,
> > > > "fromTermHits":24742,
> > > > "fromTermHitsTotalDf":24742,
> > > > "toTermHits":24742,
> > > > "toTermHitsTotalDf":24742,
> > > > "toTermDirectCount":24627,
> > > > "smallSetsDeferred":115,
> > > > "toSetDocsAdded":24742}},
> > > >
> > > > Via profiler and debugger, I see 150 msec spent in the outer
> > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> > > like a
> > > > lot of time to join the bitsets. Does this seem right?
> > > >
> > > > Peter
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Professional Services LucidWorks
> >


Re: Cross index join query performance

2013-09-27 Thread Peter Keegan
Hi Joel,

I tried this patch and it is quite a bit faster. Using the same query on a
larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
QTime was 100 msec! This was for true for large and small result sets.

A few notes: the patch didn't compile with 4.3 because of the
SolrCore.getLatestSchema call (which I worked around), and the package name
should be:


Unfortunately, I just learned that our uniqueKey may have to be an
alphanumeric string instead of an int, so I'm not out of the woods yet.

Good stuff - thanks.

Peter


On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein  wrote:

> It looks like you are using int join keys so you may want to check out
> SOLR-4787, specifically the hjoin and bjoin.
>
> These perform well when you have a large number of results from the
> fromIndex. If you have a small number of results in the fromIndex the
> standard join will be faster.
>
>
> On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan  >wrote:
>
> > I forgot to mention - this is Solr 4.3
> >
> > Peter
> >
> >
> >
> > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan  > >wrote:
> >
> > > I'm doing a cross-core join query and the join query is 30X slower than
> > > each of the 2 individual queries. Here are the queries:
> > >
> > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > > QTime: 5 msec
> > > hit count: 1000
> > >
> > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> > 0.3]
> > > QTime: 4 msec
> > > hit count: 25K
> > >
> > > Join query:
> > >
> >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid
>  to=docid}fld1:[0.1 TO 0.3]
> > > QTime: 160 msec
> > > hit count: 205
> > >
> > > Here are the index spec's:
> > >
> > > mainindex size: 117K docs, 1 segment
> > > mainindex schema:
> > > > > required="true" multiValued="false" />
> > > > > stored="true" multiValued="false" />
> > >docid
> > >
> > > subindex size: 117K docs, 1 segment
> > > subindex schema:
> > > > > required="true" multiValued="false" />
> > > > > required="false" multiValued="false" />
> > >docid
> > >
> > > With debugQuery=true I see:
> > >   "debug":{
> > > "join":{
> > >   "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> 0.3]":{
> > > "time":155,
> > > "fromSetSize":24742,
> > > "toSetSize":24742,
> > > "fromTermCount":117810,
> > > "fromTermTotalDf":117810,
> > > "fromTermDirectCount":117810,
> > > "fromTermHits":24742,
> > > "fromTermHitsTotalDf":24742,
> > > "toTermHits":24742,
> > > "toTermHitsTotalDf":24742,
> > > "toTermDirectCount":24627,
> > > "smallSetsDeferred":115,
> > > "toSetDocsAdded":24742}},
> > >
> > > Via profiler and debugger, I see 150 msec spent in the outer
> > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> > like a
> > > lot of time to join the bitsets. Does this seem right?
> > >
> > > Peter
> > >
> > >
> >
>
>
>
> --
> Joel Bernstein
> Professional Services LucidWorks
>


Re: Cross index join query performance

2013-09-26 Thread Joel Bernstein
It looks like you are using int join keys so you may want to check out
SOLR-4787, specifically the hjoin and bjoin.

These perform well when you have a large number of results from the
fromIndex. If you have a small number of results in the fromIndex the
standard join will be faster.


On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan wrote:

> I forgot to mention - this is Solr 4.3
>
> Peter
>
>
>
> On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan  >wrote:
>
> > I'm doing a cross-core join query and the join query is 30X slower than
> > each of the 2 individual queries. Here are the queries:
> >
> > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > QTime: 5 msec
> > hit count: 1000
> >
> > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> 0.3]
> > QTime: 4 msec
> > hit count: 25K
> >
> > Join query:
> >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindex
>  from=docid to=docid}fld1:[0.1 TO 0.3]
> > QTime: 160 msec
> > hit count: 205
> >
> > Here are the index spec's:
> >
> > mainindex size: 117K docs, 1 segment
> > mainindex schema:
> > > required="true" multiValued="false" />
> > > stored="true" multiValued="false" />
> >docid
> >
> > subindex size: 117K docs, 1 segment
> > subindex schema:
> > > required="true" multiValued="false" />
> > > required="false" multiValued="false" />
> >docid
> >
> > With debugQuery=true I see:
> >   "debug":{
> > "join":{
> >   "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{
> > "time":155,
> > "fromSetSize":24742,
> > "toSetSize":24742,
> > "fromTermCount":117810,
> > "fromTermTotalDf":117810,
> > "fromTermDirectCount":117810,
> > "fromTermHits":24742,
> > "fromTermHitsTotalDf":24742,
> > "toTermHits":24742,
> > "toTermHitsTotalDf":24742,
> > "toTermDirectCount":24627,
> > "smallSetsDeferred":115,
> > "toSetDocsAdded":24742}},
> >
> > Via profiler and debugger, I see 150 msec spent in the outer
> > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> like a
> > lot of time to join the bitsets. Does this seem right?
> >
> > Peter
> >
> >
>



-- 
Joel Bernstein
Professional Services LucidWorks


Re: Cross index join query performance

2013-09-25 Thread Peter Keegan
I forgot to mention - this is Solr 4.3

Peter



On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan wrote:

> I'm doing a cross-core join query and the join query is 30X slower than
> each of the 2 individual queries. Here are the queries:
>
> Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> QTime: 5 msec
> hit count: 1000
>
> Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3]
> QTime: 4 msec
> hit count: 25K
>
> Join query:
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex
>  toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
> QTime: 160 msec
> hit count: 205
>
> Here are the index spec's:
>
> mainindex size: 117K docs, 1 segment
> mainindex schema:
> required="true" multiValued="false" />
> stored="true" multiValued="false" />
>docid
>
> subindex size: 117K docs, 1 segment
> subindex schema:
> required="true" multiValued="false" />
> required="false" multiValued="false" />
>docid
>
> With debugQuery=true I see:
>   "debug":{
> "join":{
>   "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{
> "time":155,
> "fromSetSize":24742,
> "toSetSize":24742,
> "fromTermCount":117810,
> "fromTermTotalDf":117810,
> "fromTermDirectCount":117810,
> "fromTermHits":24742,
> "fromTermHitsTotalDf":24742,
> "toTermHits":24742,
> "toTermHitsTotalDf":24742,
> "toTermDirectCount":24627,
> "smallSetsDeferred":115,
> "toSetDocsAdded":24742}},
>
> Via profiler and debugger, I see 150 msec spent in the outer
> 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a
> lot of time to join the bitsets. Does this seem right?
>
> Peter
>
>


Cross index join query performance

2013-09-25 Thread Peter Keegan
I'm doing a cross-core join query and the join query is 30X slower than
each of the 2 individual queries. Here are the queries:

Main query: http://localhost:8983/solr/mainindex/select?q=title:java
QTime: 5 msec
hit count: 1000

Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3]
QTime: 4 msec
hit count: 25K

Join query:
http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex
toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
QTime: 160 msec
hit count: 205

Here are the index spec's:

mainindex size: 117K docs, 1 segment
mainindex schema:
   
   
   docid

subindex size: 117K docs, 1 segment
subindex schema:
   
   
   docid

With debugQuery=true I see:
  "debug":{
"join":{
  "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{
"time":155,
"fromSetSize":24742,
"toSetSize":24742,
"fromTermCount":117810,
"fromTermTotalDf":117810,
"fromTermDirectCount":117810,
"fromTermHits":24742,
"fromTermHitsTotalDf":24742,
"toTermHits":24742,
"toTermHitsTotalDf":24742,
"toTermDirectCount":24627,
"smallSetsDeferred":115,
"toSetDocsAdded":24742}},

Via profiler and debugger, I see 150 msec spent in the outer
'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a
lot of time to join the bitsets. Does this seem right?

Peter


Re: Solr4 update and query performance question

2013-08-15 Thread Erick Erickson
bq: There is no batching while updating/inserting documents in Solr3

Correct, but all the updates only went to the server you targeted them for.
The batching you're seeing is the auto-distributing the docs to the various
shards, a whole different animal.

Keep an eye on: https://issues.apache.org/jira/browse/SOLR-4816. You might
prompt Joel to see if this is testable. This JIRA routes the docs directly
to the leader of the shard they should go to. IOW it does the routing on
the client side. There will still be batching from the leader to the
replicas, but this should help.

It is usually a Bad Thing to commit after every batch either in Solr 3 or
Solr 4 from the client. I suspect you're right that the wait for all the
searchers on all the shards is one of your problems. Try configuring
autocommit (both hard and soft) in solrconfig.xml and forgetting the commit
bits from the client. This is the usual pattern in Solr4.

Your soft commit (which may be commented out) controls when the documents
are searchable. It is less expensive than hard commits with
openSearcher=true and makes docs visible. Hard commit closes the current
segment and opens a new one. So set up openSearcher=false for your hard
commit and a soft commit interval of whatever latency you can stand would
by my recommendation.

Final note: if you set your hard commit with openSearcher=false, do it
fairly often since it truncates the transaction logs and is quite
inexpensive. If you let your tlog grow huge, if you kill your server and
re-start Solr you get into a situation where solr may replay the tlog. If
it has a bazillion docs in it that can take a very long time to start up.

Best
Erick




On Wed, Aug 14, 2013 at 4:39 PM, Joshi, Shital  wrote:

> We didn't copy/paste Solr3 config to solr4. We started with Solr4 config
> and only updated new searcher queries and few other things.
>
> There is no batching while updating/inserting documents in Solr3, is that
> correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4
> it takes about 3-4 minutes. We noticed in Solr4 logs that, commit only
> returns after new searcher is created across all nodes. This is possibly
> cause waitSearcher=true by default in Solr4. This was not the case with
> Solr3, commit would return without waiting for new searcher creation.
>
> In order to improve performance with Solr4, we first changed from
> commit=true to commit=false in update URL and added autoHardCommit setting
> in solrconfig.xml. This improved performance from 3-4 minutes to 1-2
> minutes but that is not good enough.
>
> Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class
> from 10 to 1000 and deployed this class in
> $JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted
> solr4 nodes. But we still see the batch size of 10 being used. Did we
> change correct variable/class?
>
> Next thing We will try using softCommit=true in update url and check if it
> gives us desired performance.
>
> Thanks for looking into this. Appreciate your help.
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, August 13, 2013 8:12 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr4 update and query performance question
>
> 1> That's hard-coded at present. There's anecdotal evidence that there
>  are throughput improvements with larger batch sizes, but no action
>  yet.
> 2> Yep, all searchers are also re-opened, caches re-warmed, etc.
> 3> Odd. I'm assuming your Solr3 was master/slave setup? Seeing the
> queries would help diagnose this. Also, did you try to copy/paste
> the configuration from your Solr3 to Solr4? I'd start with the
> Solr4 and copy/paste only the parts needed from your SOlr3 setup.
>
> Best
> Erick
>
>
> On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital 
> wrote:
>
> > Hi,
> >
> > We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes
> > with about 450 mil documents (~90 mil per shard). We're loading 1000 or
> > less documents in CSV format every few minutes. In Solr3, with 300 mil
> > documents, it used to take 30 seconds to load 1000 documents while in
> > Solr4, its taking up to 3 minutes to load 1000 documents. We're using
> > custom sharding, we include _shard_=shardid parameter in update command.
> > Upon looking Solr4 log files we found that:
> >
> > 1.   Documents are added in a batch of 10 records. How do we increase
> > this batch size from 10 to 1000 documents?
> >
> > 2.  We do hard commit after loading 1000 documents. For every hard
> > commit, it refreshes searcher on all nodes. Are all caches also refreshed
> > when hard commit happens? 

RE: Solr4 update and query performance question

2013-08-14 Thread Joshi, Shital
We didn't copy/paste Solr3 config to solr4. We started with Solr4 config and 
only updated new searcher queries and few other things.

There is no batching while updating/inserting documents in Solr3, is that 
correct? Committing 1000 documents in Solr3 takes 19 seconds while in Solr4 it 
takes about 3-4 minutes. We noticed in Solr4 logs that, commit only returns 
after new searcher is created across all nodes. This is possibly cause 
waitSearcher=true by default in Solr4. This was not the case with Solr3, commit 
would return without waiting for new searcher creation. 

In order to improve performance with Solr4, we first changed from commit=true 
to commit=false in update URL and added autoHardCommit setting in 
solrconfig.xml. This improved performance from 3-4 minutes to 1-2 minutes but 
that is not good enough. 

Then we changed maxBufferedAddsPerServer value in SolrCmdDistributor class from 
10 to 1000 and deployed this class in 
$JETTY_TEMP_FOLDER/solr-webapp/webapp/WEB-INF/classes folder and restarted 
solr4 nodes. But we still see the batch size of 10 being used. Did we change 
correct variable/class? 

Next thing We will try using softCommit=true in update url and check if it 
gives us desired performance. 

Thanks for looking into this. Appreciate your help. 

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, August 13, 2013 8:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr4 update and query performance question

1> That's hard-coded at present. There's anecdotal evidence that there
 are throughput improvements with larger batch sizes, but no action
 yet.
2> Yep, all searchers are also re-opened, caches re-warmed, etc.
3> Odd. I'm assuming your Solr3 was master/slave setup? Seeing the
queries would help diagnose this. Also, did you try to copy/paste
the configuration from your Solr3 to Solr4? I'd start with the
Solr4 and copy/paste only the parts needed from your SOlr3 setup.

Best
Erick


On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital  wrote:

> Hi,
>
> We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes
> with about 450 mil documents (~90 mil per shard). We're loading 1000 or
> less documents in CSV format every few minutes. In Solr3, with 300 mil
> documents, it used to take 30 seconds to load 1000 documents while in
> Solr4, its taking up to 3 minutes to load 1000 documents. We're using
> custom sharding, we include _shard_=shardid parameter in update command.
> Upon looking Solr4 log files we found that:
>
> 1.   Documents are added in a batch of 10 records. How do we increase
> this batch size from 10 to 1000 documents?
>
> 2.  We do hard commit after loading 1000 documents. For every hard
> commit, it refreshes searcher on all nodes. Are all caches also refreshed
> when hard commit happens? We're planning to change to soft commit and do
> auto hard commit every 10-15 minutes.
>
> 3.  We're not seeing improved query performance compared to Solr3.
> Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20
> seconds with Solr4. We think this could be due to frequent hard commits and
> searcher refresh. Do you think when we change to soft commit and increase
> the batch size, we will see better query performance.
>
> Thanks!
>
>
>


Re: Solr4 update and query performance question

2013-08-13 Thread Erick Erickson
1> That's hard-coded at present. There's anecdotal evidence that there
 are throughput improvements with larger batch sizes, but no action
 yet.
2> Yep, all searchers are also re-opened, caches re-warmed, etc.
3> Odd. I'm assuming your Solr3 was master/slave setup? Seeing the
queries would help diagnose this. Also, did you try to copy/paste
the configuration from your Solr3 to Solr4? I'd start with the
Solr4 and copy/paste only the parts needed from your SOlr3 setup.

Best
Erick


On Mon, Aug 12, 2013 at 11:38 AM, Joshi, Shital  wrote:

> Hi,
>
> We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes
> with about 450 mil documents (~90 mil per shard). We're loading 1000 or
> less documents in CSV format every few minutes. In Solr3, with 300 mil
> documents, it used to take 30 seconds to load 1000 documents while in
> Solr4, its taking up to 3 minutes to load 1000 documents. We're using
> custom sharding, we include _shard_=shardid parameter in update command.
> Upon looking Solr4 log files we found that:
>
> 1.   Documents are added in a batch of 10 records. How do we increase
> this batch size from 10 to 1000 documents?
>
> 2.  We do hard commit after loading 1000 documents. For every hard
> commit, it refreshes searcher on all nodes. Are all caches also refreshed
> when hard commit happens? We're planning to change to soft commit and do
> auto hard commit every 10-15 minutes.
>
> 3.  We're not seeing improved query performance compared to Solr3.
> Queries which took 3-5 seconds in Solr3 (300 mil docs) are taking 20
> seconds with Solr4. We think this could be due to frequent hard commits and
> searcher refresh. Do you think when we change to soft commit and increase
> the batch size, we will see better query performance.
>
> Thanks!
>
>
>


Solr4 update and query performance question

2013-08-12 Thread Joshi, Shital
Hi,

We have SolrCloud (4.4.0) cluster (5 shards and 2 replicas) on 10 boxes with 
about 450 mil documents (~90 mil per shard). We're loading 1000 or less 
documents in CSV format every few minutes. In Solr3, with 300 mil documents, it 
used to take 30 seconds to load 1000 documents while in Solr4, its taking up to 
3 minutes to load 1000 documents. We're using custom sharding, we include 
_shard_=shardid parameter in update command. Upon looking Solr4 log files we 
found that:

1.   Documents are added in a batch of 10 records. How do we increase this 
batch size from 10 to 1000 documents?

2.  We do hard commit after loading 1000 documents. For every hard commit, 
it refreshes searcher on all nodes. Are all caches also refreshed when hard 
commit happens? We're planning to change to soft commit and do auto hard commit 
every 10-15 minutes.

3.  We're not seeing improved query performance compared to Solr3. Queries 
which took 3-5 seconds in Solr3 (300 mil docs) are taking 20 seconds with 
Solr4. We think this could be due to frequent hard commits and searcher 
refresh. Do you think when we change to soft commit and increase the batch 
size, we will see better query performance.

Thanks!




Re: Query Performance

2013-07-28 Thread Jack Krupansky

start is a window into the sorted, matched documents.

So, whether the second query matches a lot less documents, and hence has 
less to sort, depends once again on where X lies in the distribution of 
documents. If X if the first term in the field, the second query would match 
all documents (except for the first since you used "{" rather than "["). 
But, the query itself might be slower than a *:* query depending on exactly 
how Lucene evaluates range queries.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Sunday, July 28, 2013 5:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Query Performance

Actually I have to rewrite my question:

Query 1:

q=*:*&rows=row_count&sort=id asc&start=X

and

Query2:

q={X TO *}&rows=row_count&sort=id asc&start=0



2013/7/29 Jack Krupansky 


The second query excludes documents matched by [* TO X], while the first
query matches all documents.

Relative performance will depend on relative match count and the sort time
on the matched documents. Sorting will likely be the dominant factor - for
equal number of documents. So, it depends on whether starting with X
excludes or includes the majority of documents, relative to whatever
row_count might be.

Generally, you should only sort a small number of documents/results.

Or, consider DocValues since they are designed for sorting.

-- Jack Krupansky

-Original Message- From: Furkan KAMACI
Sent: Sunday, July 28, 2013 5:06 PM
To: solr-user@lucene.apache.org
Subject: Query Performance


What is the difference between:

q=*:*&rows=row_count&sort=id asc

and

q={X TO *}&rows=row_count&sort=id asc

Does the first one trys to get all the documents but cut the result or 
they

are same or...? What happens at underlying process of Solr for that two
queries?





Re: Query Performance

2013-07-28 Thread Furkan KAMACI
Actually I have to rewrite my question:

Query 1:

q=*:*&rows=row_count&sort=id asc&start=X

and

Query2:

q={X TO *}&rows=row_count&sort=id asc&start=0



2013/7/29 Jack Krupansky 

> The second query excludes documents matched by [* TO X], while the first
> query matches all documents.
>
> Relative performance will depend on relative match count and the sort time
> on the matched documents. Sorting will likely be the dominant factor - for
> equal number of documents. So, it depends on whether starting with X
> excludes or includes the majority of documents, relative to whatever
> row_count might be.
>
> Generally, you should only sort a small number of documents/results.
>
> Or, consider DocValues since they are designed for sorting.
>
> -- Jack Krupansky
>
> -Original Message- From: Furkan KAMACI
> Sent: Sunday, July 28, 2013 5:06 PM
> To: solr-user@lucene.apache.org
> Subject: Query Performance
>
>
> What is the difference between:
>
> q=*:*&rows=row_count&sort=id asc
>
> and
>
> q={X TO *}&rows=row_count&sort=id asc
>
> Does the first one trys to get all the documents but cut the result or they
> are same or...? What happens at underlying process of Solr for that two
> queries?
>


Re: Query Performance

2013-07-28 Thread Jack Krupansky
The second query excludes documents matched by [* TO X], while the first 
query matches all documents.


Relative performance will depend on relative match count and the sort time 
on the matched documents. Sorting will likely be the dominant factor - for 
equal number of documents. So, it depends on whether starting with X 
excludes or includes the majority of documents, relative to whatever 
row_count might be.


Generally, you should only sort a small number of documents/results.

Or, consider DocValues since they are designed for sorting.

-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Sunday, July 28, 2013 5:06 PM
To: solr-user@lucene.apache.org
Subject: Query Performance

What is the difference between:

q=*:*&rows=row_count&sort=id asc

and

q={X TO *}&rows=row_count&sort=id asc

Does the first one trys to get all the documents but cut the result or they
are same or...? What happens at underlying process of Solr for that two
queries? 



Query Performance

2013-07-28 Thread Furkan KAMACI
What is the difference between:

q=*:*&rows=row_count&sort=id asc

and

q={X TO *}&rows=row_count&sort=id asc

Does the first one trys to get all the documents but cut the result or they
are same or...? What happens at underlying process of Solr for that two
queries?


Re: How to improve the Solr "OR" query performance

2013-07-03 Thread Otis Gospodnetic
Hi,

Does that OR query need to be scored?
Does it repeat?
If answers are no and yes, you should use fq, not q.

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 3, 2013 at 12:07 PM, Kevin Osborn  wrote:
> Also, what is the total document count for your result set? We have an
> application that is also very slow because it does a lot or OR queries. The
> problem is that the result set is very large because of the ORs. Profiling
> showed that Solr was spending the bulk of its time scoring the documents.
>
> Also, instead of OR, you may want to look at dismax or edismax. For search
> box type applications, OR is not really what you want. It just seems like
> what you want.
>
> -Kevin
>
>
> On Wed, Jul 3, 2013 at 5:10 AM, Toke Eskildsen 
> wrote:
>
>> On Wed, 2013-07-03 at 05:48 +0200, huasanyelao wrote:
>> > The response time for "OR" query is around 1-2seconds(the "AND" query is
>> just about 30ms-40ms ).
>>
>> The number of hits will also be much lower for the AND-query. To check
>> whether it is the OR or the size of the result set that is the problem,
>> please try and construct an AND-based query that hits about as many
>> documents as your slow OR query.
>>
>> With an index size of just 9GB, I am surprised that you use sharding.
>> Have you tried using just a single instance to avoid the merge-overhead?
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>
>
> --
> *KEVIN OSBORN*
> LEAD SOFTWARE ENGINEER
> CNET Content Solutions
> OFFICE 949.399.8714
> CELL 949.310.4677  SKYPE osbornk
> 5 Park Plaza, Suite 600, Irvine, CA 92614
> [image: CNET Content Solutions]


  1   2   3   4   >