Re: SolrCloud setup - any advice?

2013-09-27 Thread Neil Prosser
Good point. I'd seen docValues and wondered whether they might be of use in
this situation. However, as I understand it they require a value to be set
for all documents until Solr 4.5. Is that true or was I imagining reading
that?


On 25 September 2013 11:36, Erick Erickson erickerick...@gmail.com wrote:

 H, I confess I haven't had a chance to play with this yet,
 but have you considered docValues for some of your fields? See:
 http://wiki.apache.org/solr/DocValues

 And just to tantalize you:

  Since Solr4.2 to build a forward index for a field, for purposes of
 sorting, faceting, grouping, function queries, etc.

  You can specify a different docValuesFormat on the fieldType
 (docValuesFormat=Disk) to only load minimal data on the heap, keeping
 other data structures on disk.

 Do note, though:
  Not a huge improvement for a static index

 this latter isn't a problem though since you don't have a static index

 Erick

 On Tue, Sep 24, 2013 at 4:13 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Shawn: unfortunately the current problems are with facet.method=enum!
 
  Erick: We already round our date queries so they're the same for at least
  an hour so thankfully our fq entries will be reusable. However, I'll
 take a
  look at reducing the cache and autowarming counts and see what the effect
  on hit ratios and performance are.
 
  For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds
  and our hard commit is 15 minutes.
 
  You're right about those sorted fields having a lot of unique values.
 They
  can be any number between 0 and 10,000,000 (it's sparsely populated
 across
  the documents) and could appear in several variants across multiple
  documents. This is probably a good area for seeing what we can bend with
  regard to our requirements for sorting/boosting. I've just looked at two
  shards and they've each got upwards of 1000 terms showing in the schema
  browser for one (potentially out of 60) fields.
 
 
 
  On 21 September 2013 20:07, Erick Erickson erickerick...@gmail.com
 wrote:
 
  About caches. The queryResultCache is only useful when you expect there
  to be a number of _identical_ queries. Think of this cache as a map
 where
  the key is the query and the value is just a list of N document IDs
  (internal)
  where N is your window size. Paging is often the place where this is
 used.
  Take a look at your admin page for this cache, you can see the hit
 rates.
  But, the take-away is that this is a very small cache memory-wise,
 varying
  it is probably not a great predictor of memory usage.
 
  The filterCache is more intense memory wise, it's another map where the
  key is the fq clause and the value is bounded by maxDoc/8. Take a
  close look at this in the admin screen and see what the hit ratio is. It
  may
  be that you can make it much smaller and still get a lot of benefit.
  _Especially_ considering it could occupy about 44G of memory.
  (43,000,000 / 8) * 8192 And the autowarm count is excessive in
  most cases from what I've seen. Cutting the autowarm down to, say, 16
  may not make a noticeable difference in your response time. And if
  you're using NOW in your fq clauses, it's almost totally useless, see:
  http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
 
  Also, read Uwe's excellent blog about MMapDirectory here:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
  for some problems with over-allocating memory to the JVM. Of course
  if you're hitting OOMs, well.
 
  bq: order them by one of their fields.
  This is one place I'd look first. How many unique values are in each
 field
  that you sort on? This is one of the major memory consumers. You can
  get a sense of this by looking at admin/schema-browser and selecting
  the fields you sort on. There's a text box with the number of terms
  returned,
  then a / ### where ### is the total count of unique terms in the field.
  NOTE:
  in 4.4 this will be -1 for multiValued fields, but you shouldn't be
  sorting on
  those anyway. How many fields are you sorting on anyway, and of what
 types?
 
  For your SolrCloud experiments, what are your soft and hard commit
  intervals?
  Because something is really screwy here. Your sharding moving the
  number of docs down this low per shard should be fast. Back to the point
  above, the only good explanation I can come up with from this remove is
  that the fields you sort on have a LOT of unique values. It's possible
 that
  the total number of unique values isn't scaling with sharding. That is,
  each
  shard may have, say, 90% of all unique terms (number from thin air).
 Worth
  checking anyway, but a stretch.
 
  This is definitely unusual...
 
  Best,
  Erick
 
 
  On Thu, Sep 19, 2013 at 8:20 AM, Neil Prosser neil.pros...@gmail.com
  wrote:
   Apologies for the giant email. Hopefully it makes sense.
  
   We've been trying out SolrCloud to solve some scalability issues with
 our
   current 

Re: SolrCloud setup - any advice?

2013-09-27 Thread Erick Erickson
I think you're right, but you can specify a default value in your schema.xml
to at least see if this is a good path to follow.

Best,
Erick

On Fri, Sep 27, 2013 at 3:46 AM, Neil Prosser neil.pros...@gmail.com wrote:
 Good point. I'd seen docValues and wondered whether they might be of use in
 this situation. However, as I understand it they require a value to be set
 for all documents until Solr 4.5. Is that true or was I imagining reading
 that?


 On 25 September 2013 11:36, Erick Erickson erickerick...@gmail.com wrote:

 H, I confess I haven't had a chance to play with this yet,
 but have you considered docValues for some of your fields? See:
 http://wiki.apache.org/solr/DocValues

 And just to tantalize you:

  Since Solr4.2 to build a forward index for a field, for purposes of
 sorting, faceting, grouping, function queries, etc.

  You can specify a different docValuesFormat on the fieldType
 (docValuesFormat=Disk) to only load minimal data on the heap, keeping
 other data structures on disk.

 Do note, though:
  Not a huge improvement for a static index

 this latter isn't a problem though since you don't have a static index

 Erick

 On Tue, Sep 24, 2013 at 4:13 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Shawn: unfortunately the current problems are with facet.method=enum!
 
  Erick: We already round our date queries so they're the same for at least
  an hour so thankfully our fq entries will be reusable. However, I'll
 take a
  look at reducing the cache and autowarming counts and see what the effect
  on hit ratios and performance are.
 
  For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds
  and our hard commit is 15 minutes.
 
  You're right about those sorted fields having a lot of unique values.
 They
  can be any number between 0 and 10,000,000 (it's sparsely populated
 across
  the documents) and could appear in several variants across multiple
  documents. This is probably a good area for seeing what we can bend with
  regard to our requirements for sorting/boosting. I've just looked at two
  shards and they've each got upwards of 1000 terms showing in the schema
  browser for one (potentially out of 60) fields.
 
 
 
  On 21 September 2013 20:07, Erick Erickson erickerick...@gmail.com
 wrote:
 
  About caches. The queryResultCache is only useful when you expect there
  to be a number of _identical_ queries. Think of this cache as a map
 where
  the key is the query and the value is just a list of N document IDs
  (internal)
  where N is your window size. Paging is often the place where this is
 used.
  Take a look at your admin page for this cache, you can see the hit
 rates.
  But, the take-away is that this is a very small cache memory-wise,
 varying
  it is probably not a great predictor of memory usage.
 
  The filterCache is more intense memory wise, it's another map where the
  key is the fq clause and the value is bounded by maxDoc/8. Take a
  close look at this in the admin screen and see what the hit ratio is. It
  may
  be that you can make it much smaller and still get a lot of benefit.
  _Especially_ considering it could occupy about 44G of memory.
  (43,000,000 / 8) * 8192 And the autowarm count is excessive in
  most cases from what I've seen. Cutting the autowarm down to, say, 16
  may not make a noticeable difference in your response time. And if
  you're using NOW in your fq clauses, it's almost totally useless, see:
  http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
 
  Also, read Uwe's excellent blog about MMapDirectory here:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
  for some problems with over-allocating memory to the JVM. Of course
  if you're hitting OOMs, well.
 
  bq: order them by one of their fields.
  This is one place I'd look first. How many unique values are in each
 field
  that you sort on? This is one of the major memory consumers. You can
  get a sense of this by looking at admin/schema-browser and selecting
  the fields you sort on. There's a text box with the number of terms
  returned,
  then a / ### where ### is the total count of unique terms in the field.
  NOTE:
  in 4.4 this will be -1 for multiValued fields, but you shouldn't be
  sorting on
  those anyway. How many fields are you sorting on anyway, and of what
 types?
 
  For your SolrCloud experiments, what are your soft and hard commit
  intervals?
  Because something is really screwy here. Your sharding moving the
  number of docs down this low per shard should be fast. Back to the point
  above, the only good explanation I can come up with from this remove is
  that the fields you sort on have a LOT of unique values. It's possible
 that
  the total number of unique values isn't scaling with sharding. That is,
  each
  shard may have, say, 90% of all unique terms (number from thin air).
 Worth
  checking anyway, but a stretch.
 
  This is definitely unusual...
 
  Best,
  Erick
 
 
  On Thu, 

Re: SolrCloud setup - any advice?

2013-09-25 Thread Erick Erickson
H, I confess I haven't had a chance to play with this yet,
but have you considered docValues for some of your fields? See:
http://wiki.apache.org/solr/DocValues

And just to tantalize you:

 Since Solr4.2 to build a forward index for a field, for purposes of sorting, 
 faceting, grouping, function queries, etc.

 You can specify a different docValuesFormat on the fieldType 
 (docValuesFormat=Disk) to only load minimal data on the heap, keeping other 
 data structures on disk.

Do note, though:
 Not a huge improvement for a static index

this latter isn't a problem though since you don't have a static index

Erick

On Tue, Sep 24, 2013 at 4:13 AM, Neil Prosser neil.pros...@gmail.com wrote:
 Shawn: unfortunately the current problems are with facet.method=enum!

 Erick: We already round our date queries so they're the same for at least
 an hour so thankfully our fq entries will be reusable. However, I'll take a
 look at reducing the cache and autowarming counts and see what the effect
 on hit ratios and performance are.

 For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds
 and our hard commit is 15 minutes.

 You're right about those sorted fields having a lot of unique values. They
 can be any number between 0 and 10,000,000 (it's sparsely populated across
 the documents) and could appear in several variants across multiple
 documents. This is probably a good area for seeing what we can bend with
 regard to our requirements for sorting/boosting. I've just looked at two
 shards and they've each got upwards of 1000 terms showing in the schema
 browser for one (potentially out of 60) fields.



 On 21 September 2013 20:07, Erick Erickson erickerick...@gmail.com wrote:

 About caches. The queryResultCache is only useful when you expect there
 to be a number of _identical_ queries. Think of this cache as a map where
 the key is the query and the value is just a list of N document IDs
 (internal)
 where N is your window size. Paging is often the place where this is used.
 Take a look at your admin page for this cache, you can see the hit rates.
 But, the take-away is that this is a very small cache memory-wise, varying
 it is probably not a great predictor of memory usage.

 The filterCache is more intense memory wise, it's another map where the
 key is the fq clause and the value is bounded by maxDoc/8. Take a
 close look at this in the admin screen and see what the hit ratio is. It
 may
 be that you can make it much smaller and still get a lot of benefit.
 _Especially_ considering it could occupy about 44G of memory.
 (43,000,000 / 8) * 8192 And the autowarm count is excessive in
 most cases from what I've seen. Cutting the autowarm down to, say, 16
 may not make a noticeable difference in your response time. And if
 you're using NOW in your fq clauses, it's almost totally useless, see:
 http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/

 Also, read Uwe's excellent blog about MMapDirectory here:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 for some problems with over-allocating memory to the JVM. Of course
 if you're hitting OOMs, well.

 bq: order them by one of their fields.
 This is one place I'd look first. How many unique values are in each field
 that you sort on? This is one of the major memory consumers. You can
 get a sense of this by looking at admin/schema-browser and selecting
 the fields you sort on. There's a text box with the number of terms
 returned,
 then a / ### where ### is the total count of unique terms in the field.
 NOTE:
 in 4.4 this will be -1 for multiValued fields, but you shouldn't be
 sorting on
 those anyway. How many fields are you sorting on anyway, and of what types?

 For your SolrCloud experiments, what are your soft and hard commit
 intervals?
 Because something is really screwy here. Your sharding moving the
 number of docs down this low per shard should be fast. Back to the point
 above, the only good explanation I can come up with from this remove is
 that the fields you sort on have a LOT of unique values. It's possible that
 the total number of unique values isn't scaling with sharding. That is,
 each
 shard may have, say, 90% of all unique terms (number from thin air). Worth
 checking anyway, but a stretch.

 This is definitely unusual...

 Best,
 Erick


 On Thu, Sep 19, 2013 at 8:20 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Apologies for the giant email. Hopefully it makes sense.
 
  We've been trying out SolrCloud to solve some scalability issues with our
  current setup and have run into problems. I'd like to describe our
 current
  setup, our queries and the sort of load we see and am hoping someone
 might
  be able to spot the massive flaw in the way I've been trying to set
 things
  up.
 
  We currently run Solr 4.0.0 in the old style Master/Slave replication. We
  have five slaves, each running Centos with 96GB of RAM, 24 cores and with
  48GB assigned to the JVM 

Re: SolrCloud setup - any advice?

2013-09-24 Thread Neil Prosser
Shawn: unfortunately the current problems are with facet.method=enum!

Erick: We already round our date queries so they're the same for at least
an hour so thankfully our fq entries will be reusable. However, I'll take a
look at reducing the cache and autowarming counts and see what the effect
on hit ratios and performance are.

For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds
and our hard commit is 15 minutes.

You're right about those sorted fields having a lot of unique values. They
can be any number between 0 and 10,000,000 (it's sparsely populated across
the documents) and could appear in several variants across multiple
documents. This is probably a good area for seeing what we can bend with
regard to our requirements for sorting/boosting. I've just looked at two
shards and they've each got upwards of 1000 terms showing in the schema
browser for one (potentially out of 60) fields.



On 21 September 2013 20:07, Erick Erickson erickerick...@gmail.com wrote:

 About caches. The queryResultCache is only useful when you expect there
 to be a number of _identical_ queries. Think of this cache as a map where
 the key is the query and the value is just a list of N document IDs
 (internal)
 where N is your window size. Paging is often the place where this is used.
 Take a look at your admin page for this cache, you can see the hit rates.
 But, the take-away is that this is a very small cache memory-wise, varying
 it is probably not a great predictor of memory usage.

 The filterCache is more intense memory wise, it's another map where the
 key is the fq clause and the value is bounded by maxDoc/8. Take a
 close look at this in the admin screen and see what the hit ratio is. It
 may
 be that you can make it much smaller and still get a lot of benefit.
 _Especially_ considering it could occupy about 44G of memory.
 (43,000,000 / 8) * 8192 And the autowarm count is excessive in
 most cases from what I've seen. Cutting the autowarm down to, say, 16
 may not make a noticeable difference in your response time. And if
 you're using NOW in your fq clauses, it's almost totally useless, see:
 http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/

 Also, read Uwe's excellent blog about MMapDirectory here:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 for some problems with over-allocating memory to the JVM. Of course
 if you're hitting OOMs, well.

 bq: order them by one of their fields.
 This is one place I'd look first. How many unique values are in each field
 that you sort on? This is one of the major memory consumers. You can
 get a sense of this by looking at admin/schema-browser and selecting
 the fields you sort on. There's a text box with the number of terms
 returned,
 then a / ### where ### is the total count of unique terms in the field.
 NOTE:
 in 4.4 this will be -1 for multiValued fields, but you shouldn't be
 sorting on
 those anyway. How many fields are you sorting on anyway, and of what types?

 For your SolrCloud experiments, what are your soft and hard commit
 intervals?
 Because something is really screwy here. Your sharding moving the
 number of docs down this low per shard should be fast. Back to the point
 above, the only good explanation I can come up with from this remove is
 that the fields you sort on have a LOT of unique values. It's possible that
 the total number of unique values isn't scaling with sharding. That is,
 each
 shard may have, say, 90% of all unique terms (number from thin air). Worth
 checking anyway, but a stretch.

 This is definitely unusual...

 Best,
 Erick


 On Thu, Sep 19, 2013 at 8:20 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Apologies for the giant email. Hopefully it makes sense.
 
  We've been trying out SolrCloud to solve some scalability issues with our
  current setup and have run into problems. I'd like to describe our
 current
  setup, our queries and the sort of load we see and am hoping someone
 might
  be able to spot the massive flaw in the way I've been trying to set
 things
  up.
 
  We currently run Solr 4.0.0 in the old style Master/Slave replication. We
  have five slaves, each running Centos with 96GB of RAM, 24 cores and with
  48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs)
 but
  aren't slow either. Our GC parameters aren't particularly exciting, just
  -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.
 
  Our index size ranges between 144GB and 200GB (when we optimise it back
  down, since we've had bad experiences with large cores). We've got just
  over 37M documents some are smallish but most range between 1000-6000
  bytes. We regularly update documents so large portions of the index will
 be
  touched leading to a maxDocs value of around 43M.
 
  Query load ranges between 400req/s to 800req/s across the five slaves
  throughout the day, increasing and decreasing gradually over a period of
  hours, rather than bursting.
 
  Most of 

Re: SolrCloud setup - any advice?

2013-09-21 Thread Erick Erickson
About caches. The queryResultCache is only useful when you expect there
to be a number of _identical_ queries. Think of this cache as a map where
the key is the query and the value is just a list of N document IDs (internal)
where N is your window size. Paging is often the place where this is used.
Take a look at your admin page for this cache, you can see the hit rates.
But, the take-away is that this is a very small cache memory-wise, varying
it is probably not a great predictor of memory usage.

The filterCache is more intense memory wise, it's another map where the
key is the fq clause and the value is bounded by maxDoc/8. Take a
close look at this in the admin screen and see what the hit ratio is. It may
be that you can make it much smaller and still get a lot of benefit.
_Especially_ considering it could occupy about 44G of memory.
(43,000,000 / 8) * 8192 And the autowarm count is excessive in
most cases from what I've seen. Cutting the autowarm down to, say, 16
may not make a noticeable difference in your response time. And if
you're using NOW in your fq clauses, it's almost totally useless, see:
http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/

Also, read Uwe's excellent blog about MMapDirectory here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
for some problems with over-allocating memory to the JVM. Of course
if you're hitting OOMs, well.

bq: order them by one of their fields.
This is one place I'd look first. How many unique values are in each field
that you sort on? This is one of the major memory consumers. You can
get a sense of this by looking at admin/schema-browser and selecting
the fields you sort on. There's a text box with the number of terms returned,
then a / ### where ### is the total count of unique terms in the field. NOTE:
in 4.4 this will be -1 for multiValued fields, but you shouldn't be sorting on
those anyway. How many fields are you sorting on anyway, and of what types?

For your SolrCloud experiments, what are your soft and hard commit intervals?
Because something is really screwy here. Your sharding moving the
number of docs down this low per shard should be fast. Back to the point
above, the only good explanation I can come up with from this remove is
that the fields you sort on have a LOT of unique values. It's possible that
the total number of unique values isn't scaling with sharding. That is, each
shard may have, say, 90% of all unique terms (number from thin air). Worth
checking anyway, but a stretch.

This is definitely unusual...

Best,
Erick


On Thu, Sep 19, 2013 at 8:20 AM, Neil Prosser neil.pros...@gmail.com wrote:
 Apologies for the giant email. Hopefully it makes sense.

 We've been trying out SolrCloud to solve some scalability issues with our
 current setup and have run into problems. I'd like to describe our current
 setup, our queries and the sort of load we see and am hoping someone might
 be able to spot the massive flaw in the way I've been trying to set things
 up.

 We currently run Solr 4.0.0 in the old style Master/Slave replication. We
 have five slaves, each running Centos with 96GB of RAM, 24 cores and with
 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs) but
 aren't slow either. Our GC parameters aren't particularly exciting, just
 -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.

 Our index size ranges between 144GB and 200GB (when we optimise it back
 down, since we've had bad experiences with large cores). We've got just
 over 37M documents some are smallish but most range between 1000-6000
 bytes. We regularly update documents so large portions of the index will be
 touched leading to a maxDocs value of around 43M.

 Query load ranges between 400req/s to 800req/s across the five slaves
 throughout the day, increasing and decreasing gradually over a period of
 hours, rather than bursting.

 Most of our documents have upwards of twenty fields. We use different
 fields to store territory variant (we have around 30 territories) values
 and also boost based on the values in some of these fields (integer ones).

 So an average query can do a range filter by two of the territory variant
 fields, filter by a non-territory variant field. Facet by a field or two
 (may be territory variant). Bring back the values of 60 fields. Boost query
 on field values of a non-territory variant field. Boost by values of two
 territory-variant fields. Dismax query on up to 20 fields (with boosts) and
 phrase boost on those fields too. They're pretty big queries. We don't do
 any index-time boosting. We try to keep things dynamic so we can alter our
 boosts on-the-fly.

 Another common query is to list documents with a given set of IDs and
 select documents with a common reference and order them by one of their
 fields.

 Auto-commit every 30 minutes. Replication polls every 30 minutes.

 Document cache:
   * initialSize - 32768
   * size - 32768

 Filter cache:
   * autowarmCount - 

Re: SolrCloud setup - any advice?

2013-09-20 Thread Neil Prosser
Sorry, my bad. For SolrCloud soft commits are enabled (every 15 seconds). I
do a hard commit from an external cron task via curl every 15 minutes.

The version I'm using for the SolrCloud setup is 4.4.0.

Document cache warm-up times are 0ms.
Filter cache warm-up times are between 3 and 7 seconds.
Query result cache warm-up times are between 0 and 2 seconds.

I haven't tried disabling the caches, I'll give that a try and see what
happens.

This isn't a static index. We are indexing documents into it. We're keeping
up with our normal update load, which is to make updates to a percentage of
the documents (thousands, not hundreds).




On 19 September 2013 20:33, Shreejay Nair shreej...@gmail.com wrote:

 Hi Neil,

 Although you haven't mentioned it, just wanted to confirm - do you have
 soft commits enabled?

 Also what's the version of solr you are using for the solr cloud setup?
 4.0.0 had lots of memory and zk related issues. What's the warmup time for
 your caches? Have you tried disabling the caches?

 Is this is static index or you documents are added continuously?

 The answers to these questions might help us pin point the issue...

 On Thursday, September 19, 2013, Neil Prosser wrote:

  Apologies for the giant email. Hopefully it makes sense.
 
  We've been trying out SolrCloud to solve some scalability issues with our
  current setup and have run into problems. I'd like to describe our
 current
  setup, our queries and the sort of load we see and am hoping someone
 might
  be able to spot the massive flaw in the way I've been trying to set
 things
  up.
 
  We currently run Solr 4.0.0 in the old style Master/Slave replication. We
  have five slaves, each running Centos with 96GB of RAM, 24 cores and with
  48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs)
 but
  aren't slow either. Our GC parameters aren't particularly exciting, just
  -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.
 
  Our index size ranges between 144GB and 200GB (when we optimise it back
  down, since we've had bad experiences with large cores). We've got just
  over 37M documents some are smallish but most range between 1000-6000
  bytes. We regularly update documents so large portions of the index will
 be
  touched leading to a maxDocs value of around 43M.
 
  Query load ranges between 400req/s to 800req/s across the five slaves
  throughout the day, increasing and decreasing gradually over a period of
  hours, rather than bursting.
 
  Most of our documents have upwards of twenty fields. We use different
  fields to store territory variant (we have around 30 territories) values
  and also boost based on the values in some of these fields (integer
 ones).
 
  So an average query can do a range filter by two of the territory variant
  fields, filter by a non-territory variant field. Facet by a field or two
  (may be territory variant). Bring back the values of 60 fields. Boost
 query
  on field values of a non-territory variant field. Boost by values of two
  territory-variant fields. Dismax query on up to 20 fields (with boosts)
 and
  phrase boost on those fields too. They're pretty big queries. We don't do
  any index-time boosting. We try to keep things dynamic so we can alter
 our
  boosts on-the-fly.
 
  Another common query is to list documents with a given set of IDs and
  select documents with a common reference and order them by one of their
  fields.
 
  Auto-commit every 30 minutes. Replication polls every 30 minutes.
 
  Document cache:
* initialSize - 32768
* size - 32768
 
  Filter cache:
* autowarmCount - 128
* initialSize - 8192
* size - 8192
 
  Query result cache:
* autowarmCount - 128
* initialSize - 8192
* size - 8192
 
  After a replicated core has finished downloading (probably while it's
  warming) we see requests which usually take around 100ms taking over 5s.
 GC
  logs show concurrent mode failure.
 
  I was wondering whether anyone can help with sizing the boxes required to
  split this index down into shards for use with SolrCloud and roughly how
  much memory we should be assigning to the JVM. Everything I've read
  suggests that running with a 48GB heap is way too high but every attempt
  I've made to reduce the cache sizes seems to wind up causing
 out-of-memory
  problems. Even dropping all cache sizes by 50% and reducing the heap by
 50%
  caused problems.
 
  I've already tried using SolrCloud 10 shards (around 3.7M documents per
  shard, each with one replica) and kept the cache sizes low:
 
  Document cache:
* initialSize - 1024
* size - 1024
 
  Filter cache:
* autowarmCount - 128
* initialSize - 512
* size - 512
 
  Query result cache:
* autowarmCount - 32
* initialSize - 128
* size - 128
 
  Even when running on six machines in AWS with SSDs, 24GB heap (out of
 60GB
  memory) and four shards on two boxes and three on the rest I still see
  concurrent mode failure. This looks like it's causing 

Re: SolrCloud setup - any advice?

2013-09-20 Thread Shawn Heisey
On 9/19/2013 9:20 AM, Neil Prosser wrote:
 Apologies for the giant email. Hopefully it makes sense.

Because of its size, I'm going to reply inline like this and I'm going
to trim out portions of your original message.  I hope that's not
horribly confusing to you!  Looking through my archive of the mailing
list, I see that I have given you some of this information before.

 Our index size ranges between 144GB and 200GB (when we optimise it back
 down, since we've had bad experiences with large cores). We've got just
 over 37M documents some are smallish but most range between 1000-6000
 bytes. We regularly update documents so large portions of the index will be
 touched leading to a maxDocs value of around 43M.
 
 Query load ranges between 400req/s to 800req/s across the five slaves
 throughout the day, increasing and decreasing gradually over a period of
 hours, rather than bursting.

With indexes of that size and 96GB of RAM, you're starting to get into
the size range where severe performance problems begin happening.  Also,
with no GC tuning other than turning on CMS (and a HUGE 48GB heap on top
of that), you're going to run into extremely long GC pause times.  Your
query load is what I would call quite high, which will make those GC
problems quite frequent.

This is the problem I was running into with only an 8GB heap, with
similar tuning where I just turned on CMS.  When Solr disappears for 10+
seconds at a time for garbage collection, the load balancer will
temporarily drop that server from the available pool.

I'm aware that this is your old setup, so we'll put it aside for now  so
we can concentrate on your SolrCloud setup.

 Most of our documents have upwards of twenty fields. We use different
 fields to store territory variant (we have around 30 territories) values
 and also boost based on the values in some of these fields (integer ones).
 
 So an average query can do a range filter by two of the territory variant
 fields, filter by a non-territory variant field. Facet by a field or two
 (may be territory variant). Bring back the values of 60 fields. Boost query
 on field values of a non-territory variant field. Boost by values of two
 territory-variant fields. Dismax query on up to 20 fields (with boosts) and
 phrase boost on those fields too. They're pretty big queries. We don't do
 any index-time boosting. We try to keep things dynamic so we can alter our
 boosts on-the-fly.

The nature of your main queries (and possibly your filters) is probably
always going to be a little memory hungry, but it sounds like the facets
are probably what's requiring such incredible amounts of heap RAM.

Try putting a facet.method parameter into your request handler defaults
and set it to enum.  The default is fc which means fieldcache - it
basically loads all the indexed terms for that field on the entire index
into the field cache.  Multiply that by the number of fields that you
facet on (across all your queries), and it can be a real problem.
Memory is always going to be required for quick facets, but it's
generally better to let the OS handle it automatically with disk caching
than to load it into the java heap.

Your next paragraph (which I trimmed) talks about sorting, which is
another thing that eats up java heap.  The amount taken is based on the
number of documents in the index, and a chunk is taken for every field
that you use for sorting.  See if you can reduce the number of fields
you use for sorting.

 Even when running on six machines in AWS with SSDs, 24GB heap (out of 60GB
 memory) and four shards on two boxes and three on the rest I still see
 concurrent mode failure. This looks like it's causing ZooKeeper to mark the
 node as down and things begin to struggle.
 
 Is concurrent mode failure just something that will inevitably happen or is
 it avoidable by dropping the CMSInitiatingOccupancyFraction?

I assume that concurrent mode failure is what gets logged preceding a
full garbage collection.  Aggressively tuning your GC will help
immensely.  The link below has what I am currently using.  Someone on
IRC was saying that they have a 48GB heap with similar settings and they
never see huge pauses.  These tuning parameters don't use fixed memory
sizes, so it should work with any size max heap:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Otis has mentioned G1.  What I found when I used G1 was that it worked
extremely well *almost* all of the time.  The occasions for full garbage
collections were a LOT less frequent, but when they happened, the pause
was *even longer* than the untuned CMS.  That caused big problems for me
and my load balancer.  Until someone can come up with some awesome G1
tuning parameters, I personally will continue to avoid it except for
small-heap applications.  G1 is an awesome idea.  If it can be tuned, it
will probably be better than a tuned CMS.

Switching to facet.method=enum as outlined above will probably do the
most for letting you decrease your max java heap.  

SolrCloud setup - any advice?

2013-09-19 Thread Neil Prosser
Apologies for the giant email. Hopefully it makes sense.

We've been trying out SolrCloud to solve some scalability issues with our
current setup and have run into problems. I'd like to describe our current
setup, our queries and the sort of load we see and am hoping someone might
be able to spot the massive flaw in the way I've been trying to set things
up.

We currently run Solr 4.0.0 in the old style Master/Slave replication. We
have five slaves, each running Centos with 96GB of RAM, 24 cores and with
48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs) but
aren't slow either. Our GC parameters aren't particularly exciting, just
-XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.

Our index size ranges between 144GB and 200GB (when we optimise it back
down, since we've had bad experiences with large cores). We've got just
over 37M documents some are smallish but most range between 1000-6000
bytes. We regularly update documents so large portions of the index will be
touched leading to a maxDocs value of around 43M.

Query load ranges between 400req/s to 800req/s across the five slaves
throughout the day, increasing and decreasing gradually over a period of
hours, rather than bursting.

Most of our documents have upwards of twenty fields. We use different
fields to store territory variant (we have around 30 territories) values
and also boost based on the values in some of these fields (integer ones).

So an average query can do a range filter by two of the territory variant
fields, filter by a non-territory variant field. Facet by a field or two
(may be territory variant). Bring back the values of 60 fields. Boost query
on field values of a non-territory variant field. Boost by values of two
territory-variant fields. Dismax query on up to 20 fields (with boosts) and
phrase boost on those fields too. They're pretty big queries. We don't do
any index-time boosting. We try to keep things dynamic so we can alter our
boosts on-the-fly.

Another common query is to list documents with a given set of IDs and
select documents with a common reference and order them by one of their
fields.

Auto-commit every 30 minutes. Replication polls every 30 minutes.

Document cache:
  * initialSize - 32768
  * size - 32768

Filter cache:
  * autowarmCount - 128
  * initialSize - 8192
  * size - 8192

Query result cache:
  * autowarmCount - 128
  * initialSize - 8192
  * size - 8192

After a replicated core has finished downloading (probably while it's
warming) we see requests which usually take around 100ms taking over 5s. GC
logs show concurrent mode failure.

I was wondering whether anyone can help with sizing the boxes required to
split this index down into shards for use with SolrCloud and roughly how
much memory we should be assigning to the JVM. Everything I've read
suggests that running with a 48GB heap is way too high but every attempt
I've made to reduce the cache sizes seems to wind up causing out-of-memory
problems. Even dropping all cache sizes by 50% and reducing the heap by 50%
caused problems.

I've already tried using SolrCloud 10 shards (around 3.7M documents per
shard, each with one replica) and kept the cache sizes low:

Document cache:
  * initialSize - 1024
  * size - 1024

Filter cache:
  * autowarmCount - 128
  * initialSize - 512
  * size - 512

Query result cache:
  * autowarmCount - 32
  * initialSize - 128
  * size - 128

Even when running on six machines in AWS with SSDs, 24GB heap (out of 60GB
memory) and four shards on two boxes and three on the rest I still see
concurrent mode failure. This looks like it's causing ZooKeeper to mark the
node as down and things begin to struggle.

Is concurrent mode failure just something that will inevitably happen or is
it avoidable by dropping the CMSInitiatingOccupancyFraction?

If anyone has anything that might shove me in the right direction I'd be
very grateful. I'm wondering whether our set-up will just never work and
maybe we're expecting too much.

Many thanks,

Neil


Re: SolrCloud setup - any advice?

2013-09-19 Thread Shreejay Nair
Hi Neil,

Although you haven't mentioned it, just wanted to confirm - do you have
soft commits enabled?

Also what's the version of solr you are using for the solr cloud setup?
4.0.0 had lots of memory and zk related issues. What's the warmup time for
your caches? Have you tried disabling the caches?

Is this is static index or you documents are added continuously?

The answers to these questions might help us pin point the issue...

On Thursday, September 19, 2013, Neil Prosser wrote:

 Apologies for the giant email. Hopefully it makes sense.

 We've been trying out SolrCloud to solve some scalability issues with our
 current setup and have run into problems. I'd like to describe our current
 setup, our queries and the sort of load we see and am hoping someone might
 be able to spot the massive flaw in the way I've been trying to set things
 up.

 We currently run Solr 4.0.0 in the old style Master/Slave replication. We
 have five slaves, each running Centos with 96GB of RAM, 24 cores and with
 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs) but
 aren't slow either. Our GC parameters aren't particularly exciting, just
 -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.

 Our index size ranges between 144GB and 200GB (when we optimise it back
 down, since we've had bad experiences with large cores). We've got just
 over 37M documents some are smallish but most range between 1000-6000
 bytes. We regularly update documents so large portions of the index will be
 touched leading to a maxDocs value of around 43M.

 Query load ranges between 400req/s to 800req/s across the five slaves
 throughout the day, increasing and decreasing gradually over a period of
 hours, rather than bursting.

 Most of our documents have upwards of twenty fields. We use different
 fields to store territory variant (we have around 30 territories) values
 and also boost based on the values in some of these fields (integer ones).

 So an average query can do a range filter by two of the territory variant
 fields, filter by a non-territory variant field. Facet by a field or two
 (may be territory variant). Bring back the values of 60 fields. Boost query
 on field values of a non-territory variant field. Boost by values of two
 territory-variant fields. Dismax query on up to 20 fields (with boosts) and
 phrase boost on those fields too. They're pretty big queries. We don't do
 any index-time boosting. We try to keep things dynamic so we can alter our
 boosts on-the-fly.

 Another common query is to list documents with a given set of IDs and
 select documents with a common reference and order them by one of their
 fields.

 Auto-commit every 30 minutes. Replication polls every 30 minutes.

 Document cache:
   * initialSize - 32768
   * size - 32768

 Filter cache:
   * autowarmCount - 128
   * initialSize - 8192
   * size - 8192

 Query result cache:
   * autowarmCount - 128
   * initialSize - 8192
   * size - 8192

 After a replicated core has finished downloading (probably while it's
 warming) we see requests which usually take around 100ms taking over 5s. GC
 logs show concurrent mode failure.

 I was wondering whether anyone can help with sizing the boxes required to
 split this index down into shards for use with SolrCloud and roughly how
 much memory we should be assigning to the JVM. Everything I've read
 suggests that running with a 48GB heap is way too high but every attempt
 I've made to reduce the cache sizes seems to wind up causing out-of-memory
 problems. Even dropping all cache sizes by 50% and reducing the heap by 50%
 caused problems.

 I've already tried using SolrCloud 10 shards (around 3.7M documents per
 shard, each with one replica) and kept the cache sizes low:

 Document cache:
   * initialSize - 1024
   * size - 1024

 Filter cache:
   * autowarmCount - 128
   * initialSize - 512
   * size - 512

 Query result cache:
   * autowarmCount - 32
   * initialSize - 128
   * size - 128

 Even when running on six machines in AWS with SSDs, 24GB heap (out of 60GB
 memory) and four shards on two boxes and three on the rest I still see
 concurrent mode failure. This looks like it's causing ZooKeeper to mark the
 node as down and things begin to struggle.

 Is concurrent mode failure just something that will inevitably happen or is
 it avoidable by dropping the CMSInitiatingOccupancyFraction?

 If anyone has anything that might shove me in the right direction I'd be
 very grateful. I'm wondering whether our set-up will just never work and
 maybe we're expecting too much.

 Many thanks,

 Neil



Re: SolrCloud setup - any advice?

2013-09-19 Thread Otis Gospodnetic
Hi Neil,

Consider using G1 instead.  See http://blog.sematext.com/?s=g1

If that doesn't help, we can play with various JVM parameters.  The latest
version of SPM for Solr exposes information about sizes and utilization of
JVM memory pools, which may help you understand which JVM params you need
to change, how, and whether your changes are achieving the desired effect.

Otis
Solr  ElasticSearch Support
http://sematext.com/


On Sep 19, 2013 11:21 AM, Neil Prosser neil.pros...@gmail.com wrote:

 Apologies for the giant email. Hopefully it makes sense.

 We've been trying out SolrCloud to solve some scalability issues with our
 current setup and have run into problems. I'd like to describe our current
 setup, our queries and the sort of load we see and am hoping someone might
 be able to spot the massive flaw in the way I've been trying to set things
 up.

 We currently run Solr 4.0.0 in the old style Master/Slave replication. We
 have five slaves, each running Centos with 96GB of RAM, 24 cores and with
 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs) but
 aren't slow either. Our GC parameters aren't particularly exciting, just
 -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.

 Our index size ranges between 144GB and 200GB (when we optimise it back
 down, since we've had bad experiences with large cores). We've got just
 over 37M documents some are smallish but most range between 1000-6000
 bytes. We regularly update documents so large portions of the index will be
 touched leading to a maxDocs value of around 43M.

 Query load ranges between 400req/s to 800req/s across the five slaves
 throughout the day, increasing and decreasing gradually over a period of
 hours, rather than bursting.

 Most of our documents have upwards of twenty fields. We use different
 fields to store territory variant (we have around 30 territories) values
 and also boost based on the values in some of these fields (integer ones).

 So an average query can do a range filter by two of the territory variant
 fields, filter by a non-territory variant field. Facet by a field or two
 (may be territory variant). Bring back the values of 60 fields. Boost query
 on field values of a non-territory variant field. Boost by values of two
 territory-variant fields. Dismax query on up to 20 fields (with boosts) and
 phrase boost on those fields too. They're pretty big queries. We don't do
 any index-time boosting. We try to keep things dynamic so we can alter our
 boosts on-the-fly.

 Another common query is to list documents with a given set of IDs and
 select documents with a common reference and order them by one of their
 fields.

 Auto-commit every 30 minutes. Replication polls every 30 minutes.

 Document cache:
   * initialSize - 32768
   * size - 32768

 Filter cache:
   * autowarmCount - 128
   * initialSize - 8192
   * size - 8192

 Query result cache:
   * autowarmCount - 128
   * initialSize - 8192
   * size - 8192

 After a replicated core has finished downloading (probably while it's
 warming) we see requests which usually take around 100ms taking over 5s. GC
 logs show concurrent mode failure.

 I was wondering whether anyone can help with sizing the boxes required to
 split this index down into shards for use with SolrCloud and roughly how
 much memory we should be assigning to the JVM. Everything I've read
 suggests that running with a 48GB heap is way too high but every attempt
 I've made to reduce the cache sizes seems to wind up causing out-of-memory
 problems. Even dropping all cache sizes by 50% and reducing the heap by 50%
 caused problems.

 I've already tried using SolrCloud 10 shards (around 3.7M documents per
 shard, each with one replica) and kept the cache sizes low:

 Document cache:
   * initialSize - 1024
   * size - 1024

 Filter cache:
   * autowarmCount - 128
   * initialSize - 512
   * size - 512

 Query result cache:
   * autowarmCount - 32
   * initialSize - 128
   * size - 128

 Even when running on six machines in AWS with SSDs, 24GB heap (out of 60GB
 memory) and four shards on two boxes and three on the rest I still see
 concurrent mode failure. This looks like it's causing ZooKeeper to mark the
 node as down and things begin to struggle.

 Is concurrent mode failure just something that will inevitably happen or is
 it avoidable by dropping the CMSInitiatingOccupancyFraction?

 If anyone has anything that might shove me in the right direction I'd be
 very grateful. I'm wondering whether our set-up will just never work and
 maybe we're expecting too much.

 Many thanks,

 Neil