Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Erick,

All my queries are based on fq (filter query). I have to send the randomly
generated queries to warm up low level lucene cache.

I went to the more tedious way to warm up low level cache without utilizing
the three caches by turning off the three caches (set values to zero). Then,
I send 800 randomly generated request to Solr. The RAM jumped from 500MB to
2.5G, and stayed there.

Then, I test individual queries against Solr. This time, I got very close
response time when I requested the first time, second time, or third time. 

The results: 

(1) average response time: 803 ms with only one request having a response
time 1 second (1042 ms)
(2) the majority of the time was spent on query, and not on faceting 
(730/803 = 90%)

So the query is the bottleneck.

I also have an interesting finding: it looks like the fq query works better
with integer type. I created string type for two properties: DateDep and
Duration since the definition of docValues=true for integer type did not
work with faceted search. There was a time I accidentally used filter query
with the string type property and I found the query performance degraded
quite a lot.

Is it generally true that fq works better with integer type  ?

If this is the case, I could create two integer type properties for two
other fq to check if I can boost the performance.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Upayavira,

Thank you very much for pointing out the potential design issue

The queries will be determined through a configuration by business users.
There will be limited number of queries every day, and will get executed by
customers repeatedly. However, business users will change the configurations
so that new queries will get generated and also will be limited. The change
can be as frequent as daily or weekly. The project is to supporting daily
promotional based on fresh index data.

Cumulatively, there can be a lot of different queries. If I still want to
take the advantage of the filterCache, can I limit the size of the three
caches so that the RAM usage will be under control?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Erick Erickson
bq:  can I limit the size of the three
caches so that the RAM usage will be under control

That's exactly what the size parameter is for.

As Upayavira says, the rough size of each entry in
the filterCache is maxDocs/8 + (sizeof query string).

queryResultCache is much smaller per entry, it's
roughly (sizeof entire query) + ((sizeof Java int) * queryResultWindowSize)

queryResultWindowSize is from solrconfig.xml. The point
here is this is rarely very bug unless you make the
queryResultCache huge.

As for documentResultCache, it's also usually not
very large, it's the (size you declare it) * (average size of a doc).

Best,
Erick

On Wed, Aug 19, 2015 at 9:12 AM, wwang525 wwang...@gmail.com wrote:
 Hi Upayavira,

 Thank you very much for pointing out the potential design issue

 The queries will be determined through a configuration by business users.
 There will be limited number of queries every day, and will get executed by
 customers repeatedly. However, business users will change the configurations
 so that new queries will get generated and also will be limited. The change
 can be as frequent as daily or weekly. The project is to supporting daily
 promotional based on fresh index data.

 Cumulatively, there can be a lot of different queries. If I still want to
 take the advantage of the filterCache, can I limit the size of the three
 caches so that the RAM usage will be under control?

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Upayavira
You say all of my queries are based upon fq? Why? How unique are they?
Remember, for each fq value, it could end up storing one bit per
document in your index. If you have 8m documents, you could end up with
a cache usage of 1Mb, for that query alone!

Filter queries are primarily designed for queries that are repeated,
e.g.: category:sport, where caching gives some advantage.

If all of your queries are unique, then move them to the q= parameter,
or make them fq={!cache=false}, otherwise you will waste memory storing
cached values that are never used, and CPU building and then destroying
those cached entries.

Upayavira

On Wed, Aug 19, 2015, at 02:25 PM, wwang525 wrote:
 Hi Erick,
 
 All my queries are based on fq (filter query). I have to send the
 randomly
 generated queries to warm up low level lucene cache.
 
 I went to the more tedious way to warm up low level cache without
 utilizing
 the three caches by turning off the three caches (set values to zero).
 Then,
 I send 800 randomly generated request to Solr. The RAM jumped from 500MB
 to
 2.5G, and stayed there.
 
 Then, I test individual queries against Solr. This time, I got very close
 response time when I requested the first time, second time, or third
 time. 
 
 The results: 
 
 (1) average response time: 803 ms with only one request having a response
 time 1 second (1042 ms)
 (2) the majority of the time was spent on query, and not on faceting 
 (730/803 = 90%)
 
 So the query is the bottleneck.
 
 I also have an interesting finding: it looks like the fq query works
 better
 with integer type. I created string type for two properties: DateDep and
 Duration since the definition of docValues=true for integer type did not
 work with faceted search. There was a time I accidentally used filter
 query
 with the string type property and I found the query performance degraded
 quite a lot.
 
 Is it generally true that fq works better with integer type  ?
 
 If this is the case, I could create two integer type properties for two
 other fq to check if I can boost the performance.
 
 Thanks
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223920.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread Upayavira
Yes, you can limit the size of the filter cache, as Erick says, but
then, you could just end up with cache churn, where you are constantly
re-populating your cache as stuff gets pushed out, only to have to
regenerate it again for the next query.

Is it possible to decompose these queries into parts?

fq=+category:sport +year:2015

could be better expressed as:
fq=category:sport
fq=year:2015

Instead of resulting in cardinality(category) * cardinality(year) cache
entries, you'd have cardinality(category) + cardinality(year).

cardinality() here simply means the number of unique values for that
field.

Upayavira

On Wed, Aug 19, 2015, at 05:23 PM, Erick Erickson wrote:
 bq:  can I limit the size of the three
 caches so that the RAM usage will be under control
 
 That's exactly what the size parameter is for.
 
 As Upayavira says, the rough size of each entry in
 the filterCache is maxDocs/8 + (sizeof query string).
 
 queryResultCache is much smaller per entry, it's
 roughly (sizeof entire query) + ((sizeof Java int) *
 queryResultWindowSize)
 
 queryResultWindowSize is from solrconfig.xml. The point
 here is this is rarely very bug unless you make the
 queryResultCache huge.
 
 As for documentResultCache, it's also usually not
 very large, it's the (size you declare it) * (average size of a doc).
 
 Best,
 Erick
 
 On Wed, Aug 19, 2015 at 9:12 AM, wwang525 wwang...@gmail.com wrote:
  Hi Upayavira,
 
  Thank you very much for pointing out the potential design issue
 
  The queries will be determined through a configuration by business users.
  There will be limited number of queries every day, and will get executed by
  customers repeatedly. However, business users will change the configurations
  so that new queries will get generated and also will be limited. The change
  can be as frequent as daily or weekly. The project is to supporting daily
  promotional based on fresh index data.
 
  Cumulatively, there can be a lot of different queries. If I still want to
  take the advantage of the filterCache, can I limit the size of the three
  caches so that the RAM usage will be under control?
 
  Thanks
 
 
 
  --
  View this message in context: 
  http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223960.html
  Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-19 Thread wwang525
Hi Upayavira,

I happened to compose individual fq for each field, such as:
fq=Gatewaycode:(...)fq=DestCode:(...)fq=DateDep:(...)fq=Duration:(...)

It is nice to know that I am not creating unnecessary cache entries since
the above method results in minimal carnality as you pointed out.

Thank





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223988.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread Erick Erickson
Lot of stuff here, let me reply to a few things:

If you're faceting on high-cardinality fields, this is expensive.
How many unique values are there in the fields you facet on?
Note, I am _not_ asking about how many values are in the fields
of the selected set, but rather how many values corpus-wide.

The decreasing response times you're seeing are entirely
expected. Besides the caches in solrconfig.xml, the lower-level
Lucene caches must be filled from disk. So the first few queries
will be slower. Usually, to get a true picture of the performance,
I'll throw away the first minute or two of a performance test. This is
fair as usually autowarming can be used to keep this perf spike
from affecting users.

DocValues are performing as I'd expect. Normally, without DV
on a field, faceting etc. require that the internal inverted structure
be un-inverted. DV fields essentially serialize this un-inverted
field to disk making building it merely a matter of reading a bunch
of contiguous memory from disk. That said, once the internal
structure is built, the performance difference between DV and not
DV should be negligible.

At the index size you're using, I wouldn't expect sharding to help
much if at all. There might even be a small penalty if you shard.
Try adding debug=timing to the query. That'll show you the
time spent in each component. NOTE: this is exclusive of the time
spent assembling the return docs (decompressing from disk,
transmitting back to the client etc). Speaking of which, if you're
returning a bunch of rows your response may be dominated by
assembling the return packet rather than scoring the docs.

Executing the same query twice is totally misleading. You're not
searching at all, but rather getting the docs from the queryResultCache
(probably). You _are_ faceting though.

The autowarm settings don't do you any good if you don't commit, i.e.
if you're not indexing. They're vitally important when you _do_ index
as you query. The firstSearcher and newSearcher events are
lists of queries that are fired when you first start Solr (and there's
nothing to autowarm) and when you commit, respectively. You might
put together queries that search, facet, sort etc. to smooth out your
initial response times.

You're right to be suspicious of randomly generated queries. On the
plus side, this is usually a worst-case scenario. Getting real user
queries is always best although I understand it may not be possible;
sometimes you just have to guess unfortunately.

I'd look hard at the faceting. From what you're saying, that's dominating
your response time. I'd be interested in seeing the results of adding
debug=timing. My bet is that faceting is taking the most time.

And, if your generated queries are all matching all the docs in the
corpus, your times are artificially high. Again, I'd expect better response
time from a corpus this size, but as always your mileage may vary.

Best,
Erick


On Tue, Aug 18, 2015 at 8:54 AM, wwang525 wwang...@gmail.com wrote:
 Hi All,

 I am working on a search service based on Solr (v5.1.0). The data size is 15
 M records. The size of the index files is 860MB. The test was performed on a
 local machine that has 8 cores with 32 G memory and CPU is 3.4Ghz (Intel
 Core i7-3770).

 I found out that setting docValues=true for faceting and grouping indeed
 boosted the performance with first-time search under cold cache scenario.
 For example, with our requests that use all the features like grouping,
 sorting, faceting, I found the difference of faceting alone can be as much
 as 300 ms.

 However, response time for the same request executed the second time seems
 to be at the same level whether the setting of docValues is true or false.
 Still, I set up docValues=true for all the faceting properties.

 The following are what I have observed:

 (1) Test single request one-by-one (no load)

 With a cold cache, I execute randomly generated queries one after another.
 The first query routinely exceed 1 second, but not usually more than 2
 seconds. I continue to generate random requests, and execute the queries
 one-by-one, the response time normally stabilized at the range of 500 ms. It
 does not seem to improve more as I continue execute randomly generated
 queries.

 (2) Load test with randomly generated requests

 Under load test scenario (each core takes 4 requests per second, and
 continue for 20 round), I can see the CPU usage jumped, and the earlier
 requests usually got much longer response time, they may even exceed 5
 seconds. However, the CPU usage pattern will then changed to the SAW shape,
 and the response time will drop, and I can see that the requests got
 executed faster and faster. I usually gets an average response time around 1
 second.

 If I execute a load test again, the average response time will continue
 drop. However, it stays at about 500 ms/per request under this load if I try
 more tests.

 These are the best results so far.

 I understand that the requests were 

Re: Is it a good query performance with this data size ?

2015-08-18 Thread wwang525
Hi Erick,

Two facets are probably demanding:

departure_date have 365 distinct values and hotel_code can have 800 distinct
values.

The docValues setting definitely helped me a lot even when all the queries
had the above two facets. I will test a list of queries with or without the
two facets after indexing the data (to take advantage of cache warming).

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223744.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread Erick Erickson
those are not that high. I was thinking of facets with thousands to
tens-of-thousands of unique values. I really wouldn't expect this to
be a huge hit unless you're querying all docs.

Let us know what you find.

Best,
Erick

On Tue, Aug 18, 2015 at 11:31 AM, wwang525 wwang...@gmail.com wrote:
 Hi Erick,

 Two facets are probably demanding:

 departure_date have 365 distinct values and hotel_code can have 800 distinct
 values.

 The docValues setting definitely helped me a lot even when all the queries
 had the above two facets. I will test a list of queries with or without the
 two facets after indexing the data (to take advantage of cache warming).

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223744.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread Erick Erickson
bq: can I turn off the three cache and send a lot of queries to Solr

I really think you're missing the easiest way to do that.
To not put anything in the filter cache, just don't send any fq clauses.

As far as the doc cache is concerned, by and large I just wouldn't
worry about it. With MMapDirectory, it's less valuable than it was
when it was created. It's primary usage is that the components in a
single query don't have to re-read the docs from disk. As far as
the queryResultCache, by not putting fq clauses on the warmup
queries you won't hit this cache next time around.

Best,
Erick

On Tue, Aug 18, 2015 at 1:17 PM, wwang525 wwang...@gmail.com wrote:
 Hi Erick,

 I just tested 10 different queries with or without the faceting search on
 the two properties : departure_date, and hotel_code. Under cold cache
 scenario, they have pretty much the same response time, and the faceting
 took much less time than the query time. Under cold cache scenario, the
 query (under timing)  is still the bottleneck.

 I understand that the low level cache needs to be warmed up to do a more
 realistic test. However, I do not have a good and consistent way to warm up
 the low level cache without caching the filter queries at the same time. If
 I load test some random queries before I test these 10 individual queries, I
 can see a better response time in some cases, but that could also be due to
 filter query cache.

 To load up low level lucene cache without creating filtercache/document
 cache etc, can I turn off the three cache and send a lot of queries to Solr
 before I start to test the performance of each individual queries?

 Thanks






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223758.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is it a good query performance with this data size ?

2015-08-18 Thread wwang525
Hi Erick,

I just tested 10 different queries with or without the faceting search on
the two properties : departure_date, and hotel_code. Under cold cache
scenario, they have pretty much the same response time, and the faceting
took much less time than the query time. Under cold cache scenario, the
query (under timing)  is still the bottleneck.

I understand that the low level cache needs to be warmed up to do a more
realistic test. However, I do not have a good and consistent way to warm up
the low level cache without caching the filter queries at the same time. If
I load test some random queries before I test these 10 individual queries, I
can see a better response time in some cases, but that could also be due to
filter query cache.

To load up low level lucene cache without creating filtercache/document
cache etc, can I turn off the three cache and send a lot of queries to Solr
before I start to test the performance of each individual queries?

Thanks






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699p4223758.html
Sent from the Solr - User mailing list archive at Nabble.com.