The symptom we see is that the java clients querying Solr see response
times in 10s of seconds (not milliseconds).
And on the tomcat's gc.log file (where Solr is running), we see very bad GC
pauses - threads being paused for 0.5 seconds per second approximately.

Some numbers for the Solr Cloud:

*Overall infrastructure:*
- Only one collection
- 16 VMs used
- 8 shards (1 leader and 1 replica per shard - each core on separate VM)

*Overview from one core:*
- Num Docs:193,623,388
- Max Doc:230,577,696
- Heap Memory Usage:231,217,880
- Deleted Docs:36,954,308
- Version:2,357,757
- Segment Count:37

*Stats from QueryHandler/select*
- requests:78,557
- errors:358
- timeouts:0
- totalTime:1,639,975.27
- avgRequestsPerSecond:2.62
- 5minRateReqsPerSecond:1.39
- 15minRateReqsPerSecond:1.64
- avgTimePerRequest:20.87
- medianRequestTime:0.70
- 75thPcRequestTime:1.11
- 95thPcRequestTime:191.76

*Stats from QueryHandler/update*
- requests:33,555
- errors:0
- timeouts:0
- totalTime:227,870.58
- avgRequestsPerSecond:1.12
- 5minRateReqsPerSecond:1.16
- 15minRateReqsPerSecond:1.23
- avgTimePerRequest:6.79
- medianRequestTime:3.16
- 75thPcRequestTime:5.27
- 95thPcRequestTime:9.33

And yet the Solr clients are reporting timeouts and very long read times.

Plus, on every server, we are seeing lots of exceptions.
For example:

Between 8:06:55 PM and 8:21:36 PM, exceptions are:

1) Request says it is coming from leader, but we are the leader:
update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2

2) org.apache.solr.common.SolrException: Request says it is coming from
leader, but we are the leader

3) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

4) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

5) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

6) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

7) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request. Zombie server list:
[HOSTA_ca_1_1456429897]

8) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: No live SolrServers
available to handle this request. Zombie server list:
[HOSTA_ca_1_1456429897]

9) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

10) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

11) org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

12) null:org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: Tried one server for read
operation and it timed out, so failing fast

Why are we seeing so many timeouts then and why so huge response times on
the client?

Thanks
SG



On Sat, Dec 3, 2016 at 4:19 PM, <billnb...@gmail.com> wrote:

> What tool is that ? The stats I would like to run on my Solr instance
>
> Bill Bell
> Sent from mobile
>
>
> > On Dec 2, 2016, at 4:49 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> >
> >> On 12/2/2016 12:01 PM, S G wrote:
> >> This post shows some stats on Solr which indicate that there might be a
> >> memory leak in there.
> >>
> >> http://stackoverflow.com/questions/40939166/is-this-a-
> memory-leak-in-solr
> >>
> >> Can someone please help to debug this?
> >> It might be a very good step in making Solr stable if we can fix this.
> >
> > +1 to what Walter said.
> >
> > I replied earlier on the stackoverflow question.
> >
> > FYI -- your 95th percentile request time of about 16 milliseconds is NOT
> > something that I would characterize as "very high."  I would *love* to
> > have statistics that good.
> >
> > Even your 99th percentile request time is not much more than a full
> > second.  If a search takes a couple of seconds, most users will not
> > really care, and some might not even notice.  It's when a large
> > percentage of queries start taking several seconds that complaints start
> > coming in.  On your system, 99 percent of your queries are completing in
> > 1.3 seconds or less, and 95 percent of them are less than 17
> > milliseconds.  That sounds quite good to me.
> >
> > In my experience, the time it takes for the browser to receive the
> > search result page and render it is a significant part of the total time
> > to see results, and often dwarfs the time spent getting info from Solr.
> >
> > Here's some numbers from Solr in my organization:
> >
> > requests:               4102054
> > errors:                 364894
> > timeouts:               49
> > totalTime:              799446287.45041
> > avgRequestsPerSecond:   1.2375565828793849
> > 5minRateReqsPerSecond:  0.8444329508327961
> > 15minRateReqsPerSecond: 0.8631197328073346
> > avgTimePerRequest:      194.88926460997587
> > medianRequestTime:      20.8566605
> > 75thPcRequestTime:      85.51328849999999
> > 95thPcRequestTime:      2202.277466549999
> > 99thPcRequestTime:      5280.375381280002
> > 999thPcRequestTime:     6866.020122961001
> >
> > The numbers above come from a distributed index that contains 167
> > million documents and takes up about 200GB of disk space across two
> > machines.
> >
> > requests:               192683
> > errors:                 124
> > timeouts:               0
> > totalTime:              199380421.985073
> > avgRequestsPerSecond    0.042222722771354554
> > 5minRateReqsPerSecon    0.00800545427600684
> > 15minRateReqsPerSecond: 0.017521222412364163
> > avgTimePerRequest:      1034.7587591280653
> > medianRequestTime:      541.591858
> > 75thPcRequestTime:      1683.83246125
> > 95thPcRequestTime:      5644.542019949997
> > 99thPcRequestTime:      9445.592394760004
> > 999thPcRequestTime:     14602.166640771007
> >
> > These numbers are from an index with about 394 million documents, taking
> > up nearly 500GB of disk space.  This index is also distributed on
> > multiple machines.
> >
> > Are you experiencing any problems other than what you perceive as slow
> > queries?  I asked some other questions on stackoverflow.  In particular,
> > I'd like to know the total memory on the server, the total number of
> > documents (maxDoc and numDoc) you're handling with this server, as well
> > as the total index size.  What do your queries look like?  What version
> > and vendor of Java are you using?  Can you share your config/schema?
> >
> > A memory leak is very unlikely, unless your Java or your operating
> > system is broken.  I can't say for sure that it's not happening, but
> > it's just not something we see around here.
> >
> > Here's what I have collected on performance issues in Solr.  This page
> > does mostly concern itself with memory, though it touches briefly on
> > other topics:
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Thanks,
> > Shawn
> >
>

Reply via email to