DisMax search on field only if it exists otherwise fall-back to another

2015-01-15 Thread Neil Prosser
Hopefully this question makes sense.

At the moment I'm using a DisMax query which looks something like the
following (massively cut-down):

?defType=dismax
&q=some query
&qf=field_one^0.5 field_two^1.0

I've got some localisation work coming up where I'd like to use the value
of one, sparsely populated field if it exists, falling back to another if
it doesn't (rather than duplicating some default value for all territories).

Using the standard query parser I understand I can do the following to get
this behaviour:

?q=if(exist(field_one),(field_one:some query),(field_three:some query))

However, I don't know how I would go about using DisMax for this type of
fallback.

Has anyone tried to do this sort of thing before? Is there a way to use
functions within the qf parameter?


Re: SolrCloud setup - any advice?

2013-09-27 Thread Neil Prosser
Good point. I'd seen docValues and wondered whether they might be of use in
this situation. However, as I understand it they require a value to be set
for all documents until Solr 4.5. Is that true or was I imagining reading
that?


On 25 September 2013 11:36, Erick Erickson  wrote:

> H, I confess I haven't had a chance to play with this yet,
> but have you considered docValues for some of your fields? See:
> http://wiki.apache.org/solr/DocValues
>
> And just to tantalize you:
>
> > Since Solr4.2 to build a forward index for a field, for purposes of
> sorting, faceting, grouping, function queries, etc.
>
> > You can specify a different docValuesFormat on the fieldType
> (docValuesFormat="Disk") to only load minimal data on the heap, keeping
> other data structures on disk.
>
> Do note, though:
> > Not a huge improvement for a static index
>
> this latter isn't a problem though since you don't have a static index
>
> Erick
>
> On Tue, Sep 24, 2013 at 4:13 AM, Neil Prosser 
> wrote:
> > Shawn: unfortunately the current problems are with facet.method=enum!
> >
> > Erick: We already round our date queries so they're the same for at least
> > an hour so thankfully our fq entries will be reusable. However, I'll
> take a
> > look at reducing the cache and autowarming counts and see what the effect
> > on hit ratios and performance are.
> >
> > For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds
> > and our hard commit is 15 minutes.
> >
> > You're right about those sorted fields having a lot of unique values.
> They
> > can be any number between 0 and 10,000,000 (it's sparsely populated
> across
> > the documents) and could appear in several variants across multiple
> > documents. This is probably a good area for seeing what we can bend with
> > regard to our requirements for sorting/boosting. I've just looked at two
> > shards and they've each got upwards of 1000 terms showing in the schema
> > browser for one (potentially out of 60) fields.
> >
> >
> >
> > On 21 September 2013 20:07, Erick Erickson 
> wrote:
> >
> >> About caches. The queryResultCache is only useful when you expect there
> >> to be a number of _identical_ queries. Think of this cache as a map
> where
> >> the key is the query and the value is just a list of N document IDs
> >> (internal)
> >> where N is your window size. Paging is often the place where this is
> used.
> >> Take a look at your admin page for this cache, you can see the hit
> rates.
> >> But, the take-away is that this is a very small cache memory-wise,
> varying
> >> it is probably not a great predictor of memory usage.
> >>
> >> The filterCache is more intense memory wise, it's another map where the
> >> key is the fq clause and the value is bounded by maxDoc/8. Take a
> >> close look at this in the admin screen and see what the hit ratio is. It
> >> may
> >> be that you can make it much smaller and still get a lot of benefit.
> >> _Especially_ considering it could occupy about 44G of memory.
> >> (43,000,000 / 8) * 8192 And the autowarm count is excessive in
> >> most cases from what I've seen. Cutting the autowarm down to, say, 16
> >> may not make a noticeable difference in your response time. And if
> >> you're using NOW in your fq clauses, it's almost totally useless, see:
> >> http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
> >>
> >> Also, read Uwe's excellent blog about MMapDirectory here:
> >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >> for some problems with over-allocating memory to the JVM. Of course
> >> if you're hitting OOMs, well.
> >>
> >> bq: order them by one of their fields.
> >> This is one place I'd look first. How many unique values are in each
> field
> >> that you sort on? This is one of the major memory consumers. You can
> >> get a sense of this by looking at admin/schema-browser and selecting
> >> the fields you sort on. There's a text box with the number of terms
> >> returned,
> >> then a / ### where ### is the total count of unique terms in the field.
> >> NOTE:
> >> in 4.4 this will be -1 for multiValued fields, but you shouldn't be
> >> sorting on
> >> those anyway. How many fields are you sorting on anyway, and of what
> types?
> >>
> >> For your SolrCloud experiments, 

Re: SolrCloud setup - any advice?

2013-09-24 Thread Neil Prosser
Shawn: unfortunately the current problems are with facet.method=enum!

Erick: We already round our date queries so they're the same for at least
an hour so thankfully our fq entries will be reusable. However, I'll take a
look at reducing the cache and autowarming counts and see what the effect
on hit ratios and performance are.

For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds
and our hard commit is 15 minutes.

You're right about those sorted fields having a lot of unique values. They
can be any number between 0 and 10,000,000 (it's sparsely populated across
the documents) and could appear in several variants across multiple
documents. This is probably a good area for seeing what we can bend with
regard to our requirements for sorting/boosting. I've just looked at two
shards and they've each got upwards of 1000 terms showing in the schema
browser for one (potentially out of 60) fields.



On 21 September 2013 20:07, Erick Erickson  wrote:

> About caches. The queryResultCache is only useful when you expect there
> to be a number of _identical_ queries. Think of this cache as a map where
> the key is the query and the value is just a list of N document IDs
> (internal)
> where N is your window size. Paging is often the place where this is used.
> Take a look at your admin page for this cache, you can see the hit rates.
> But, the take-away is that this is a very small cache memory-wise, varying
> it is probably not a great predictor of memory usage.
>
> The filterCache is more intense memory wise, it's another map where the
> key is the fq clause and the value is bounded by maxDoc/8. Take a
> close look at this in the admin screen and see what the hit ratio is. It
> may
> be that you can make it much smaller and still get a lot of benefit.
> _Especially_ considering it could occupy about 44G of memory.
> (43,000,000 / 8) * 8192 And the autowarm count is excessive in
> most cases from what I've seen. Cutting the autowarm down to, say, 16
> may not make a noticeable difference in your response time. And if
> you're using NOW in your fq clauses, it's almost totally useless, see:
> http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
>
> Also, read Uwe's excellent blog about MMapDirectory here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> for some problems with over-allocating memory to the JVM. Of course
> if you're hitting OOMs, well.
>
> bq: order them by one of their fields.
> This is one place I'd look first. How many unique values are in each field
> that you sort on? This is one of the major memory consumers. You can
> get a sense of this by looking at admin/schema-browser and selecting
> the fields you sort on. There's a text box with the number of terms
> returned,
> then a / ### where ### is the total count of unique terms in the field.
> NOTE:
> in 4.4 this will be -1 for multiValued fields, but you shouldn't be
> sorting on
> those anyway. How many fields are you sorting on anyway, and of what types?
>
> For your SolrCloud experiments, what are your soft and hard commit
> intervals?
> Because something is really screwy here. Your sharding moving the
> number of docs down this low per shard should be fast. Back to the point
> above, the only good explanation I can come up with from this remove is
> that the fields you sort on have a LOT of unique values. It's possible that
> the total number of unique values isn't scaling with sharding. That is,
> each
> shard may have, say, 90% of all unique terms (number from thin air). Worth
> checking anyway, but a stretch.
>
> This is definitely unusual...
>
> Best,
> Erick
>
>
> On Thu, Sep 19, 2013 at 8:20 AM, Neil Prosser 
> wrote:
> > Apologies for the giant email. Hopefully it makes sense.
> >
> > We've been trying out SolrCloud to solve some scalability issues with our
> > current setup and have run into problems. I'd like to describe our
> current
> > setup, our queries and the sort of load we see and am hoping someone
> might
> > be able to spot the massive flaw in the way I've been trying to set
> things
> > up.
> >
> > We currently run Solr 4.0.0 in the old style Master/Slave replication. We
> > have five slaves, each running Centos with 96GB of RAM, 24 cores and with
> > 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs)
> but
> > aren't slow either. Our GC parameters aren't particularly exciting, just
> > -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.
> >
> > Our index size ranges between 144GB and 200GB (when we optimise it back
> > down, since we've ha

Re: SolrCloud setup - any advice?

2013-09-20 Thread Neil Prosser
Sorry, my bad. For SolrCloud soft commits are enabled (every 15 seconds). I
do a hard commit from an external cron task via curl every 15 minutes.

The version I'm using for the SolrCloud setup is 4.4.0.

Document cache warm-up times are 0ms.
Filter cache warm-up times are between 3 and 7 seconds.
Query result cache warm-up times are between 0 and 2 seconds.

I haven't tried disabling the caches, I'll give that a try and see what
happens.

This isn't a static index. We are indexing documents into it. We're keeping
up with our normal update load, which is to make updates to a percentage of
the documents (thousands, not hundreds).




On 19 September 2013 20:33, Shreejay Nair  wrote:

> Hi Neil,
>
> Although you haven't mentioned it, just wanted to confirm - do you have
> soft commits enabled?
>
> Also what's the version of solr you are using for the solr cloud setup?
> 4.0.0 had lots of memory and zk related issues. What's the warmup time for
> your caches? Have you tried disabling the caches?
>
> Is this is static index or you documents are added continuously?
>
> The answers to these questions might help us pin point the issue...
>
> On Thursday, September 19, 2013, Neil Prosser wrote:
>
> > Apologies for the giant email. Hopefully it makes sense.
> >
> > We've been trying out SolrCloud to solve some scalability issues with our
> > current setup and have run into problems. I'd like to describe our
> current
> > setup, our queries and the sort of load we see and am hoping someone
> might
> > be able to spot the massive flaw in the way I've been trying to set
> things
> > up.
> >
> > We currently run Solr 4.0.0 in the old style Master/Slave replication. We
> > have five slaves, each running Centos with 96GB of RAM, 24 cores and with
> > 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs)
> but
> > aren't slow either. Our GC parameters aren't particularly exciting, just
> > -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.
> >
> > Our index size ranges between 144GB and 200GB (when we optimise it back
> > down, since we've had bad experiences with large cores). We've got just
> > over 37M documents some are smallish but most range between 1000-6000
> > bytes. We regularly update documents so large portions of the index will
> be
> > touched leading to a maxDocs value of around 43M.
> >
> > Query load ranges between 400req/s to 800req/s across the five slaves
> > throughout the day, increasing and decreasing gradually over a period of
> > hours, rather than bursting.
> >
> > Most of our documents have upwards of twenty fields. We use different
> > fields to store territory variant (we have around 30 territories) values
> > and also boost based on the values in some of these fields (integer
> ones).
> >
> > So an average query can do a range filter by two of the territory variant
> > fields, filter by a non-territory variant field. Facet by a field or two
> > (may be territory variant). Bring back the values of 60 fields. Boost
> query
> > on field values of a non-territory variant field. Boost by values of two
> > territory-variant fields. Dismax query on up to 20 fields (with boosts)
> and
> > phrase boost on those fields too. They're pretty big queries. We don't do
> > any index-time boosting. We try to keep things dynamic so we can alter
> our
> > boosts on-the-fly.
> >
> > Another common query is to list documents with a given set of IDs and
> > select documents with a common reference and order them by one of their
> > fields.
> >
> > Auto-commit every 30 minutes. Replication polls every 30 minutes.
> >
> > Document cache:
> >   * initialSize - 32768
> >   * size - 32768
> >
> > Filter cache:
> >   * autowarmCount - 128
> >   * initialSize - 8192
> >   * size - 8192
> >
> > Query result cache:
> >   * autowarmCount - 128
> >   * initialSize - 8192
> >   * size - 8192
> >
> > After a replicated core has finished downloading (probably while it's
> > warming) we see requests which usually take around 100ms taking over 5s.
> GC
> > logs show concurrent mode failure.
> >
> > I was wondering whether anyone can help with sizing the boxes required to
> > split this index down into shards for use with SolrCloud and roughly how
> > much memory we should be assigning to the JVM. Everything I've read
> > suggests that running with a 48GB heap is way too high but every attempt
> > I've made to reduce the cache sizes seems to w

SolrCloud setup - any advice?

2013-09-19 Thread Neil Prosser
Apologies for the giant email. Hopefully it makes sense.

We've been trying out SolrCloud to solve some scalability issues with our
current setup and have run into problems. I'd like to describe our current
setup, our queries and the sort of load we see and am hoping someone might
be able to spot the massive flaw in the way I've been trying to set things
up.

We currently run Solr 4.0.0 in the old style Master/Slave replication. We
have five slaves, each running Centos with 96GB of RAM, 24 cores and with
48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs) but
aren't slow either. Our GC parameters aren't particularly exciting, just
-XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.

Our index size ranges between 144GB and 200GB (when we optimise it back
down, since we've had bad experiences with large cores). We've got just
over 37M documents some are smallish but most range between 1000-6000
bytes. We regularly update documents so large portions of the index will be
touched leading to a maxDocs value of around 43M.

Query load ranges between 400req/s to 800req/s across the five slaves
throughout the day, increasing and decreasing gradually over a period of
hours, rather than bursting.

Most of our documents have upwards of twenty fields. We use different
fields to store territory variant (we have around 30 territories) values
and also boost based on the values in some of these fields (integer ones).

So an average query can do a range filter by two of the territory variant
fields, filter by a non-territory variant field. Facet by a field or two
(may be territory variant). Bring back the values of 60 fields. Boost query
on field values of a non-territory variant field. Boost by values of two
territory-variant fields. Dismax query on up to 20 fields (with boosts) and
phrase boost on those fields too. They're pretty big queries. We don't do
any index-time boosting. We try to keep things dynamic so we can alter our
boosts on-the-fly.

Another common query is to list documents with a given set of IDs and
select documents with a common reference and order them by one of their
fields.

Auto-commit every 30 minutes. Replication polls every 30 minutes.

Document cache:
  * initialSize - 32768
  * size - 32768

Filter cache:
  * autowarmCount - 128
  * initialSize - 8192
  * size - 8192

Query result cache:
  * autowarmCount - 128
  * initialSize - 8192
  * size - 8192

After a replicated core has finished downloading (probably while it's
warming) we see requests which usually take around 100ms taking over 5s. GC
logs show concurrent mode failure.

I was wondering whether anyone can help with sizing the boxes required to
split this index down into shards for use with SolrCloud and roughly how
much memory we should be assigning to the JVM. Everything I've read
suggests that running with a 48GB heap is way too high but every attempt
I've made to reduce the cache sizes seems to wind up causing out-of-memory
problems. Even dropping all cache sizes by 50% and reducing the heap by 50%
caused problems.

I've already tried using SolrCloud 10 shards (around 3.7M documents per
shard, each with one replica) and kept the cache sizes low:

Document cache:
  * initialSize - 1024
  * size - 1024

Filter cache:
  * autowarmCount - 128
  * initialSize - 512
  * size - 512

Query result cache:
  * autowarmCount - 32
  * initialSize - 128
  * size - 128

Even when running on six machines in AWS with SSDs, 24GB heap (out of 60GB
memory) and four shards on two boxes and three on the rest I still see
concurrent mode failure. This looks like it's causing ZooKeeper to mark the
node as down and things begin to struggle.

Is concurrent mode failure just something that will inevitably happen or is
it avoidable by dropping the CMSInitiatingOccupancyFraction?

If anyone has anything that might shove me in the right direction I'd be
very grateful. I'm wondering whether our set-up will just never work and
maybe we're expecting too much.

Many thanks,

Neil


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Neil Prosser
That makes sense about all bets being off. I wanted to make sure that
people whose systems are behaving sensibly weren't going to have problems.

I think I need to tame the base amount of memory the field cache takes. We
currently do boosting on several fields during most queries. We boost by at
least 64 long fields (only one or two per query, but that many fields in
total) so with a maxDocs of around 10m on each shard we're talking about
nearly 5GB of heap just for the field cache. Then I still want to have a
filter cache which protects me from chewing too much CPU during requests.

I've seen a lot of concurrent mode failures flying by while staring at the
GC log so I think I need to get GC to kick in a bit sooner and try to limit
those (is it possible to eliminate them completely?). Performance is
definitely suffering while that's happening.

I had a go with G1 and no special options and things slowed down a bit. I'm
beginning to understand what to look for in the CMS GC logs so I'll stick
with that and see where I get to.

Thanks for the link. I'll give it a proper read when I get into work
tomorrow.

I have a feeling that five shards might not be the sweet spot for the spec
of the machines I'm running on. Our goal was to replace five 96GB physicals
with 48GB heaps doing master/slave replication for an index which is at
least 120GB in size. At the moment we're using ten VMs with 24GB of RAM,
8GB heaps and around 10GB of index. These machines are managing to get the
whole index into the Linux OS cache. Hopefully the 5GB minimum for field
cache and 8GB heap is what's causing this trouble right now.


On 24 July 2013 19:06, Shawn Heisey  wrote:

> On 7/24/2013 10:33 AM, Neil Prosser wrote:
> > The log for server09 starts with it throwing OutOfMemoryErrors. At this
> > point I externally have it listed as recovering. Unfortunately I haven't
> > got the GC logs for either box in that time period.
>
> There's a lot of messages in this thread, so I apologize if this has
> been dealt with already by previous email messages.
>
> All bets are off when you start throwing OOM errors.  The state of any
> java program pretty much becomes completely undefined when you run out
> of heap memory.
>
> It just so happens that I just finished updating a wiki page about
> reducing heap requirements for another message on the list.  GC pause
> problems have already been mentioned, so increasing the heap may not be
> a real option here.  Take a look at the following for ways to reduce
> your heap requirements:
>
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Reducing_heap_requirements
>
> Thanks,
> Shawn
>
>


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Neil Prosser
Sorry, good point...

https://gist.github.com/neilprosser/d75a13d9e4b7caba51ab

I've included the log files for two servers hosting the same shard for the
same time period. The logging settings exclude anything below WARN
for org.apache.zookeeper, org.apache.solr.core.SolrCore
and org.apache.solr.update.processor.LogUpdateProcessor. That said, there's
still a lot of spam there.

The log for server09 starts with it throwing OutOfMemoryErrors. At this
point I externally have it listed as recovering. Unfortunately I haven't
got the GC logs for either box in that time period.

The key times that I know of are:

2013-07-24 07:14:08,560 - server04 registers its state as down.
2013-07-24 07:17:38,462 - server04 says it's the new leader (this ties in
with my external Graphite script observing that at 07:17 server04 was both
leader and down).
2013-07-24 07:31:21,667 - I get involved and server09 is restarted.
2013-07-24 07:31:42,408 - server04 updates its cloud state from ZooKeeper
and realises that it's the leader.
2013-07-24 07:31:42,449 - server04 registers its state as active.

I'm sorry there's so much there. I'm still getting used to what's important
for people. Both servers were running 4.3.1. I've since upgraded to 4.4.0.

If you need any more information or want me to do any filtering let me know.



On 24 July 2013 15:50, Timothy Potter  wrote:

> Log messages?
>
> On Wed, Jul 24, 2013 at 1:37 AM, Neil Prosser 
> wrote:
> > Great. Thanks for your suggestions. I'll go through them and see what I
> can
> > come up with to try and tame my GC pauses. I'll also make sure I upgrade
> to
> > 4.4 before I start. Then at least I know I've got all the latest changes.
> >
> > In the meantime, does anyone have any idea why I am able to get leaders
> who
> > are marked as down? I've just had the situation where of two nodes
> hosting
> > replicas of the same shard the leader was alive and marked as down and
> the
> > other replica was gone. I could perform searches directly on the two
> nodes
> > (with distrib=false) and once I'd restarted the node which was down the
> > leader sprung into live. I assume that since there was a change in
> > clusterstate.json it forced the leader to reconsider what it was up to.
> >
> > Does anyone know the hole my nodes are falling into? Is it likely to be
> > tied up in my GC woes?
> >
> >
> > On 23 July 2013 13:06, Otis Gospodnetic 
> wrote:
> >
> >> Hi,
> >>
> >> On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >> > Neil:
> >> >
> >> > Here's a must-read blog about why allocating more memory
> >> > to the JVM than Solr requires is a Bad Thing:
> >> >
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >> >
> >> > It turns out that you actually do yourself harm by allocating more
> >> > memory to the JVM than it really needs. Of course the problem is
> >> > figuring out how much it "really needs", which if pretty tricky.
> >> >
> >> > Your long GC pauses _might_ be ameliorated by allocating _less_
> >> > memory to the JVM, counterintuitive as that seems.
> >>
> >> or by using G1 :)
> >>
> >> See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/
> >>
> >> Otis
> >> --
> >> Solr & ElasticSearch Support -- http://sematext.com/
> >> Performance Monitoring -- http://sematext.com/spm
> >>
> >>
> >> > On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser  >
> >> wrote:
> >> >> I just have a little python script which I run with cron (luckily
> that's
> >> >> the granularity we have in Graphite). It reads the same JSON the
> admin
> >> UI
> >> >> displays and dumps numeric values into Graphite.
> >> >>
> >> >> I can open source it if you like. I just need to make sure I remove
> any
> >> >> hacks/shortcuts that I've taken because I'm working with our cluster!
> >> >>
> >> >>
> >> >> On 22 July 2013 19:26, Lance Norskog  wrote:
> >> >>
> >> >>> Are you feeding Graphite from Solr? If so, how?
> >> >>>
> >> >>>
> >> >>> On 07/19/2013 01:02 AM, Neil Prosser wrote:
> >> >>>
> >> >>>> That was overnight so I was unable to track exactly what happened
> (I'm
> >> >>>> going off our Graphite graphs here).
> >> >>>>
> >> >>>
> >> >>>
> >>
>


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Neil Prosser
Great. Thanks for your suggestions. I'll go through them and see what I can
come up with to try and tame my GC pauses. I'll also make sure I upgrade to
4.4 before I start. Then at least I know I've got all the latest changes.

In the meantime, does anyone have any idea why I am able to get leaders who
are marked as down? I've just had the situation where of two nodes hosting
replicas of the same shard the leader was alive and marked as down and the
other replica was gone. I could perform searches directly on the two nodes
(with distrib=false) and once I'd restarted the node which was down the
leader sprung into live. I assume that since there was a change in
clusterstate.json it forced the leader to reconsider what it was up to.

Does anyone know the hole my nodes are falling into? Is it likely to be
tied up in my GC woes?


On 23 July 2013 13:06, Otis Gospodnetic  wrote:

> Hi,
>
> On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson 
> wrote:
> > Neil:
> >
> > Here's a must-read blog about why allocating more memory
> > to the JVM than Solr requires is a Bad Thing:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > It turns out that you actually do yourself harm by allocating more
> > memory to the JVM than it really needs. Of course the problem is
> > figuring out how much it "really needs", which if pretty tricky.
> >
> > Your long GC pauses _might_ be ameliorated by allocating _less_
> > memory to the JVM, counterintuitive as that seems.
>
> or by using G1 :)
>
> See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
> > On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser 
> wrote:
> >> I just have a little python script which I run with cron (luckily that's
> >> the granularity we have in Graphite). It reads the same JSON the admin
> UI
> >> displays and dumps numeric values into Graphite.
> >>
> >> I can open source it if you like. I just need to make sure I remove any
> >> hacks/shortcuts that I've taken because I'm working with our cluster!
> >>
> >>
> >> On 22 July 2013 19:26, Lance Norskog  wrote:
> >>
> >>> Are you feeding Graphite from Solr? If so, how?
> >>>
> >>>
> >>> On 07/19/2013 01:02 AM, Neil Prosser wrote:
> >>>
> >>>> That was overnight so I was unable to track exactly what happened (I'm
> >>>> going off our Graphite graphs here).
> >>>>
> >>>
> >>>
>


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
I just have a little python script which I run with cron (luckily that's
the granularity we have in Graphite). It reads the same JSON the admin UI
displays and dumps numeric values into Graphite.

I can open source it if you like. I just need to make sure I remove any
hacks/shortcuts that I've taken because I'm working with our cluster!


On 22 July 2013 19:26, Lance Norskog  wrote:

> Are you feeding Graphite from Solr? If so, how?
>
>
> On 07/19/2013 01:02 AM, Neil Prosser wrote:
>
>> That was overnight so I was unable to track exactly what happened (I'm
>> going off our Graphite graphs here).
>>
>
>


Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
Sorry, I should also mention that these leader nodes which are marked as
down can actually still be queried locally with distrib=false with no
problems. Is it possible that they've somehow got themselves out-of-sync?


On 22 July 2013 13:37, Neil Prosser  wrote:

> No need to apologise. It's always good to have things like that reiterated
> in case I've misunderstood along the way.
>
> I have a feeling that it's related to garbage collection. I assume that if
> the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
> still alive and so gets marked as down. I've just taken a look at the GC
> logs and can see a couple of full collections which took longer than my ZK
> timeout of 15s). I'm still in the process of tuning the cache sizes and
> have probably got it wrong (I'm coming from a Solr instance which runs on a
> 48G heap with ~40m documents and bringing it into five shards with 8G
> heap). I thought I was being conservative with the cache sizes but I should
> probably drop them right down and start again. The entire index is cached
> by Linux so I should just need caches to help with things which eat CPU at
> request time.
>
> The indexing level is unusual because normally we wouldn't be indexing
> everything sequentially, just making delta updates to the index as things
> are changed in our MoR. However, it's handy to know how it reacts under the
> most extreme load we could give it.
>
> In the case that I set my hard commit time to 15-30 seconds with
> openSearcher set to false, how do I control when I actually do invalidate
> the caches and open a new searcher? Is this something that Solr can do
> automatically, or will I need some sort of coordinator process to perform a
> 'proper' commit from outside Solr?
>
> In our case the process of opening a new searcher is definitely a hefty
> operation. We have a large number of boosts and filters which are used for
> just about every query that is made against the index so we currently have
> them warmed which can take upwards of a minute on our giant core.
>
> Thanks for your help.
>
>
> On 22 July 2013 13:00, Erick Erickson  wrote:
>
>> Wow, you really shouldn't be having nodes go up and down so
>> frequently, that's a big red flag. That said, SolrCloud should be
>> pretty robust so this is something to pursue...
>>
>> But even a 5 minute hard commit can lead to a hefty transaction
>> log under load, you may want to reduce it substantially depending
>> on how fast you are sending docs to the index. I'm talking
>> 15-30 seconds here. It's critical that openSearcher be set to false
>> or you'll invalidate your caches that often. All a hard commit
>> with openSearcher set to false does is close off the current segment
>> and open a new one. It does NOT open/warm new searchers etc.
>>
>> The soft commits control visibility, so that's how you control
>> whether you can search the docs or not. Pardon me if I'm
>> repeating stuff you already know!
>>
>> As far as your nodes coming and going, I've seen some people have
>> good results by upping the ZooKeeper timeout limit. So I guess
>> my first question is whether the nodes are actually going out of service
>> or whether it's just a timeout issue
>>
>> Good luck!
>> Erick
>>
>> On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser 
>> wrote:
>> > Very true. I was impatient (I think less than three minutes impatient so
>> > hopefully 4.4 will save me from myself) but I didn't realise it was
>> doing
>> > something rather than just hanging. Next time I have to restart a node
>> I'll
>> > just leave and go get a cup of coffee or something.
>> >
>> > My configuration is set to auto hard-commit every 5 minutes. No auto
>> > soft-commit time is set.
>> >
>> > Over the course of the weekend, while left unattended the nodes have
>> been
>> > going up and down (I've got to solve the issue that is causing them to
>> come
>> > and go, but any suggestions on what is likely to be causing something
>> like
>> > that are welcome), at one point one of the nodes stopped taking updates.
>> > After indexing properly for a few hours with that one shard not
>> accepting
>> > updates, the replica of that shard which contains all the correct
>> documents
>> > must have replicated from the broken node and dropped documents. Is
>> there
>> > any protection against this in Solr or should I be focusing on getting
>> my
>> > nodes to be

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
No need to apologise. It's always good to have things like that reiterated
in case I've misunderstood along the way.

I have a feeling that it's related to garbage collection. I assume that if
the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
still alive and so gets marked as down. I've just taken a look at the GC
logs and can see a couple of full collections which took longer than my ZK
timeout of 15s). I'm still in the process of tuning the cache sizes and
have probably got it wrong (I'm coming from a Solr instance which runs on a
48G heap with ~40m documents and bringing it into five shards with 8G
heap). I thought I was being conservative with the cache sizes but I should
probably drop them right down and start again. The entire index is cached
by Linux so I should just need caches to help with things which eat CPU at
request time.

The indexing level is unusual because normally we wouldn't be indexing
everything sequentially, just making delta updates to the index as things
are changed in our MoR. However, it's handy to know how it reacts under the
most extreme load we could give it.

In the case that I set my hard commit time to 15-30 seconds with
openSearcher set to false, how do I control when I actually do invalidate
the caches and open a new searcher? Is this something that Solr can do
automatically, or will I need some sort of coordinator process to perform a
'proper' commit from outside Solr?

In our case the process of opening a new searcher is definitely a hefty
operation. We have a large number of boosts and filters which are used for
just about every query that is made against the index so we currently have
them warmed which can take upwards of a minute on our giant core.

Thanks for your help.


On 22 July 2013 13:00, Erick Erickson  wrote:

> Wow, you really shouldn't be having nodes go up and down so
> frequently, that's a big red flag. That said, SolrCloud should be
> pretty robust so this is something to pursue...
>
> But even a 5 minute hard commit can lead to a hefty transaction
> log under load, you may want to reduce it substantially depending
> on how fast you are sending docs to the index. I'm talking
> 15-30 seconds here. It's critical that openSearcher be set to false
> or you'll invalidate your caches that often. All a hard commit
> with openSearcher set to false does is close off the current segment
> and open a new one. It does NOT open/warm new searchers etc.
>
> The soft commits control visibility, so that's how you control
> whether you can search the docs or not. Pardon me if I'm
> repeating stuff you already know!
>
> As far as your nodes coming and going, I've seen some people have
> good results by upping the ZooKeeper timeout limit. So I guess
> my first question is whether the nodes are actually going out of service
> or whether it's just a timeout issue
>
> Good luck!
> Erick
>
> On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser 
> wrote:
> > Very true. I was impatient (I think less than three minutes impatient so
> > hopefully 4.4 will save me from myself) but I didn't realise it was doing
> > something rather than just hanging. Next time I have to restart a node
> I'll
> > just leave and go get a cup of coffee or something.
> >
> > My configuration is set to auto hard-commit every 5 minutes. No auto
> > soft-commit time is set.
> >
> > Over the course of the weekend, while left unattended the nodes have been
> > going up and down (I've got to solve the issue that is causing them to
> come
> > and go, but any suggestions on what is likely to be causing something
> like
> > that are welcome), at one point one of the nodes stopped taking updates.
> > After indexing properly for a few hours with that one shard not accepting
> > updates, the replica of that shard which contains all the correct
> documents
> > must have replicated from the broken node and dropped documents. Is there
> > any protection against this in Solr or should I be focusing on getting my
> > nodes to be more reliable? I've now got a situation where four of my five
> > shards have leaders who are marked as down and followers who are up.
> >
> > I'm going to start grabbing information about the cluster state so I can
> > track which changes are happening and in what order. I can get hold of
> Solr
> > logs and garbage collection logs while these things are happening.
> >
> > Is this all just down to my nodes being unreliable?
> >
> >
> > On 21 July 2013 13:52, Erick Erickson  wrote:
> >
> >> Well, if I'm reading this right you had a node go out of circulation
> >> an

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
Very true. I was impatient (I think less than three minutes impatient so
hopefully 4.4 will save me from myself) but I didn't realise it was doing
something rather than just hanging. Next time I have to restart a node I'll
just leave and go get a cup of coffee or something.

My configuration is set to auto hard-commit every 5 minutes. No auto
soft-commit time is set.

Over the course of the weekend, while left unattended the nodes have been
going up and down (I've got to solve the issue that is causing them to come
and go, but any suggestions on what is likely to be causing something like
that are welcome), at one point one of the nodes stopped taking updates.
After indexing properly for a few hours with that one shard not accepting
updates, the replica of that shard which contains all the correct documents
must have replicated from the broken node and dropped documents. Is there
any protection against this in Solr or should I be focusing on getting my
nodes to be more reliable? I've now got a situation where four of my five
shards have leaders who are marked as down and followers who are up.

I'm going to start grabbing information about the cluster state so I can
track which changes are happening and in what order. I can get hold of Solr
logs and garbage collection logs while these things are happening.

Is this all just down to my nodes being unreliable?


On 21 July 2013 13:52, Erick Erickson  wrote:

> Well, if I'm reading this right you had a node go out of circulation
> and then bounced nodes until that node became the leader. So of course
> it wouldn't have the documents (how could it?). Basically you shot
> yourself in the foot.
>
> Underlying here is why it took the machine you were re-starting so
> long to come up that you got impatient and started killing nodes.
> There has been quite a bit done to make that process better, so what
> version of Solr are you using? 4.4 is being voted on right now, so if
> you might want to consider upgrading.
>
> There was, for instance, a situation where it would take 3 minutes for
> machines to start up. How impatient were you?
>
> Also, what are your hard commit parameters? All of the documents
> you're indexing will be in the transaction log between hard commits,
> and when a node comes up the leader will replay everything in the tlog
> to the new node, which might be a source of why it took so long for
> the new node to come back up. At the very least the new node you were
> bringing back online will need to do a full index replication (old
> style) to get caught up.
>
> Best
> Erick
>
> On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser 
> wrote:
> > While indexing some documents to a SolrCloud cluster (10 machines, 5
> shards
> > and 2 replicas, so one replica on each machine) one of the replicas
> stopped
> > receiving documents, while the other replica of the shard continued to
> grow.
> >
> > That was overnight so I was unable to track exactly what happened (I'm
> > going off our Graphite graphs here). This morning when I was able to look
> > at the cluster both replicas of that shard were marked as down (with one
> > marked as leader). I attempted to restart the non-leader node but it
> took a
> > long time to restart so I killed it and restarted the old leader, which
> > also took a long time. I killed that one (I'm impatient) and left the
> > non-leader node to restart, not realising it was missing approximately
> 700k
> > documents that the old leader had. Eventually it restarted and became
> > leader. I restarted the old leader and it dropped the number of documents
> > it had to match the previous non-leader.
> >
> > Is this expected behaviour when a replica with fewer documents is started
> > before the other and elected leader? Should I have been paying more
> > attention to the number of documents on the server before restarting
> nodes?
> >
> > I am still in the process of tuning the caches and warming for these
> > servers but we are putting some load through the cluster so it is
> possible
> > that the nodes are having to work quite hard when a new version of the
> core
> > comes is made available. Is this likely to explain why I occasionally see
> > nodes dropping out? Unfortunately in restarting the nodes I lost the GC
> > logs to see whether that was likely to be the culprit. Is this the sort
> of
> > situation where you raise the ZooKeeper timeout a bit? Currently the
> > timeout for all nodes is 15 seconds.
> >
> > Are there any known issues which might explain what's happening? I'm just
> > getting started with SolrCloud after using standard master/slave
> > replication for an index which has got too big for one machine over the
> > last few months.
> >
> > Also, is there any particular information that would be helpful to help
> > with these issues if it should happen again?
>


Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-19 Thread Neil Prosser
While indexing some documents to a SolrCloud cluster (10 machines, 5 shards
and 2 replicas, so one replica on each machine) one of the replicas stopped
receiving documents, while the other replica of the shard continued to grow.

That was overnight so I was unable to track exactly what happened (I'm
going off our Graphite graphs here). This morning when I was able to look
at the cluster both replicas of that shard were marked as down (with one
marked as leader). I attempted to restart the non-leader node but it took a
long time to restart so I killed it and restarted the old leader, which
also took a long time. I killed that one (I'm impatient) and left the
non-leader node to restart, not realising it was missing approximately 700k
documents that the old leader had. Eventually it restarted and became
leader. I restarted the old leader and it dropped the number of documents
it had to match the previous non-leader.

Is this expected behaviour when a replica with fewer documents is started
before the other and elected leader? Should I have been paying more
attention to the number of documents on the server before restarting nodes?

I am still in the process of tuning the caches and warming for these
servers but we are putting some load through the cluster so it is possible
that the nodes are having to work quite hard when a new version of the core
comes is made available. Is this likely to explain why I occasionally see
nodes dropping out? Unfortunately in restarting the nodes I lost the GC
logs to see whether that was likely to be the culprit. Is this the sort of
situation where you raise the ZooKeeper timeout a bit? Currently the
timeout for all nodes is 15 seconds.

Are there any known issues which might explain what's happening? I'm just
getting started with SolrCloud after using standard master/slave
replication for an index which has got too big for one machine over the
last few months.

Also, is there any particular information that would be helpful to help
with these issues if it should happen again?