Just a little update on my concurrency issue. The problem I was having was that under heavy load individual Solr instances would be slow to respond eventually leading to flapping cluster membership.
I tweaked a bunch of settings in Linux, Jetty, Solr and within my application but in the end none of these changes prevented the stability issues I was having. Instead, I modified my HAProxy config to limit the maximum simultaneous number of connections on a per-server basis. By capping the number of simultaneous queries being handled by Solr at 30 I've effectively prevented long-running queries from stacking up and getting continually slower. Instead, HAProxy is now queueing up the pending requests and letting them in whenever there's available capacity. As a result Solr, behaves normally under intense load and even though queries perform more slowly during these times the it never results in runaway slowness. My best guess as to why I ran into this issue is that perhaps my query volume was large relative to the on-disk index size. As a result Solr spends almost no time waiting on disk IO. This, perhaps, leaves the door open for query-driven CPU utilization to cause more fundamental issues in Solr's performance.... Or maybe I missed something stupid at the OS level. Sigh. Many thanks for all the help! -Dave On Wed, Dec 28, 2016 at 7:11 PM, Erick Erickson <erickerick...@gmail.com> wrote: > You'll see some lines with three different times in them, "user" "sys" > and "real". > The one that really counts is "real", that's the time that the process was > stopped while GC went on. The "stop" in "Stop the world" (STW) GC > > What you're looking for is two things: > > 1> outrageously long times > and/or > 2> these happening one right after the other. > > For <2> I've seen situations where you go on to a STW pauses, collect > a tiny bit of memory (say a few meg) and try to continue only to go > right back into another. It might take, say, 2 seconds of "real" time to > do the GC then go back into another 2 second cycle 500ms later. that kind > of thing. > > GCViewer can help you make sense of the GC logs > https://sourceforge.net/projects/gcviewer/ > > Unfortunately GC tuning is "more art than science" ;( > > Best, > Erick > > Best, > Erick > > On Wed, Dec 28, 2016 at 10:57 AM, Dave Seltzer <dselt...@tveyes.com> > wrote: > > Hi Erick, > > > > You're probably right about it not being a threading issue. In general it > > seems that CPU contention could indeed be the issue. > > > > Most of the settings we're using in Solr came "right out of the box" > > including Jetty's configuration which specifies: > > > > solr.jetty.threads.min: 10 > > solr.jetty.threads.max: 10000 > > solr.jetty.threads.idle.timeout: 5000 > > solr.jetty.threads.stop.timeout: 60000 > > > > The only interesting thing we're doing is disabling the query cache. This > > is because individual hash-matching queries tend to be unique and > therefore > > don't benefit significantly from query caching. > > > > On the GC side, I'm not really sure what to look for. Here's an example > > message from /solr/logs/solr_gc.log > > > > 2016-12-28T13:48:56.872-0500: 9453.890: Total time for which application > > threads were stopped: 0.8394383 seconds, Stopping threads took: 0.0004007 > > seconds > > {Heap before GC invocations=8169 (full 124): > > par new generation total 3495296K, used 3495296K [0x00000003c0000000, > > 0x00000004c0000000, 0x00000004c0000000) > > eden space 2796288K, 100% used [0x00000003c0000000, 0x000000046aac0000, > > 0x000000046aac0000) > > from space 699008K, 100% used [0x0000000495560000, 0x00000004c0000000, > > 0x00000004c0000000) > > to space 699008K, 0% used [0x000000046aac0000, 0x000000046aac0000, > > 0x0000000495560000) > > concurrent mark-sweep generation total 12582912K, used 12111153K > > [0x00000004c0000000, 0x00000007c0000000, 0x00000007c0000000) > > Metaspace used 33470K, capacity 33998K, committed 34360K, reserved > > 1079296K > > class space used 3716K, capacity 3888K, committed 3960K, reserved > > 1048576K > > 2016-12-28T13:48:57.415-0500: 9454.434: [GC (Allocation Failure) > > 2016-12-28T13:48:57.415-0500: 9454.434: [ParNew > > Desired survivor size 644205768 bytes, new threshold 3 (max 8) > > - age 1: 284566200 bytes, 284566200 total > > - age 2: 197448288 bytes, 482014488 total > > - age 3: 168306328 bytes, 650320816 total > > - age 4: 48423744 bytes, 698744560 total > > - age 5: 17038920 bytes, 715783480 total > > : 3495296K->699008K(3495296K), 1.2399730 secs] > > 15606449K->13188910K(16078208K), 1.2403791 secs] [Times: user=4.60 > > sys=0.00, real=1.24 secs] > > > > Is there something I should be grepping for in this enormous file? > > > > Many thanks! > > > > -Dave > > > > On Wed, Dec 28, 2016 at 12:44 PM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > >> Threads are usually a container parameter I think. True, Solr wants > >> lots of threads. My return volley would be how busy is your CPU when > >> this happens? If it's pegged more threads probably aren't really going > >> to help. And if it's a GC issue then more threads would probably hurt. > >> > >> Best, > >> Erick > >> > >> On Wed, Dec 28, 2016 at 9:14 AM, Dave Seltzer <dselt...@tveyes.com> > wrote: > >> > Hi Erick, > >> > > >> > I'll dig in on these timeout settings and see how changes affect > >> behavior. > >> > > >> > One interesting aspect is that we're not indexing any content at the > >> > moment. The rate of ingress is something like 10 to 20 documents per > day. > >> > > >> > So my guess is that ZK simply is deciding that these servers are dead > >> based > >> > on the fact that responses are so very sluggish. > >> > > >> > You've mentioned lots of timeouts, but are there any settings which > >> control > >> > the number of available threads? Or is this something which is largely > >> > handled automagically? > >> > > >> > Many thanks! > >> > > >> > -Dave > >> > > >> > On Wed, Dec 28, 2016 at 11:56 AM, Erick Erickson < > >> erickerick...@gmail.com> > >> > wrote: > >> > > >> >> Dave: > >> >> > >> >> There are at least 4 timeouts (not even including ZK) that can > >> >> be relevant, defined in solr.xml: > >> >> socketTimeout > >> >> connTimeout > >> >> distribUpdateConnTimeout > >> >> distribUpdateSoTimeout > >> >> > >> >> Plus the ZK timeout > >> >> zkClientTimeout > >> >> > >> >> Plus the ZK configurations. > >> >> > >> >> So it would help narrow down what's going on if we knew why the nodes > >> >> dropped out. There are indeed a lot of messages dumped, but somewhere > >> >> in the logs there should be a root cause. > >> >> > >> >> You might see Leader Initiated Recovery (LIR) which can indicate that > >> >> an update operation from the leader took too long, the timeouts above > >> >> can be adjusted in this case. > >> >> > >> >> You might see evidence that ZK couldn't get a response from Solr in > >> >> "too long" and decided it was gone. > >> >> > >> >> You might see... > >> >> > >> >> One thing I'd look at very closely is GC processing. One of the > >> >> culprits for this behavior I've seen is a very long GC stop-the-world > >> >> pause leading to ZK thinking the node is dead and tripping this > chain. > >> >> Depending on the timeouts, "very long" might be a few seconds. > >> >> > >> >> Not entirely helpful, but until you pinpoint why the node goes into > >> >> recovery it's throwing darts at the wall. GC and log messages might > >> >> give some insight into the root cause. > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> On Wed, Dec 28, 2016 at 8:26 AM, Dave Seltzer <dselt...@tveyes.com> > >> wrote: > >> >> > Hello Everyone, > >> >> > > >> >> > I'm working on a Solr Cloud cluster which is used in a hash > matching > >> >> > application. > >> >> > > >> >> > For performance reasons we've opted to batch-execute hash matching > >> >> queries. > >> >> > This means that a single query will contain many nested queries. As > >> you > >> >> > might expect, these queries take a while to execute. (On the order > of > >> 5 > >> >> to > >> >> > 10 seconds.) > >> >> > > >> >> > I've noticed that Solr will act erratically when we send too many > >> >> > long-running queries. Specifically, heavily-loaded servers will > >> >> repeatedly > >> >> > fall out of the cluster and then recover. My theory is that there's > >> some > >> >> > limit on the number of concurrent connections and that client > queries > >> are > >> >> > preventing zookeeper related queries... but I'm not sure. I've > >> increased > >> >> > ZKClientTimeout to combat this. > >> >> > > >> >> > My question is: What configuration settings should I be looking at > in > >> >> order > >> >> > to make sure I'm maximizing the ability of Solr to handle > concurrent > >> >> > requests. > >> >> > > >> >> > Many thanks! > >> >> > > >> >> > -Dave > >> >> > >> >