Re: SolrClould 6.6 stability challenges

Shawn Heisey Sun, 05 Nov 2017 07:15:06 -0800

On 11/3/2017 10:15 PM, Rick Dig wrote:

we are trying to run solrcloud 6.6 in a production setting.
here's our config and issue
1) 3 nodes, 1 shard, replication factor 3
2) all nodes are 16GB RAM, 4 core
3) Our production load is about 2000 requests per minute
4) index is fairly small, index size is around 400 MB with 300k documents
5) autocommit is currently set to 5 minutes (even though ideally we would
like a smaller interval).
6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
7) all of this runs perfectly ok when indexing isn't happening. as soon as
we start "nrt" indexing one of the follower nodes goes down within 10 to 20
minutes. from this point on the nodes never recover unless we stop
indexing.  the master usually is the last one to fall.
8) there are maybe 5 to 7 processes indexing at the same time with document
batch sizes of 500.
9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
10) no cpu and / or oom issues that we can see.
11) cpu load does go fairly high 15 to 20 at times.


My two cents to add to what you've already seen:

With 300K documents and 400MB of index size, an 8GB heap seems veryexcessive, even with complex queries. What evidence do you have thatyou need a heap that size? Are you just following a best practicerecommendation you saw somewhere to give half your memory to Java?

This is a *tiny* index by both document count and size. Each documentcannot be very big.

Your GC log doesn't show any issues that concern me. There are a fewslow GCs, but when you index, that's probably to be expected, especiallywith an 8GB heap.

What exactly do you mean by "one of the follower nodes goes down"? Whenthis happens, are there error messages at the time of the event? Whatsymptoms are there pertaining to that specific node?

A query load of 2000 per minute is about 33 per second. Are thesequeries steady for the full minute, or is it bursty? 33 qps is high,but not insane, and with such a tiny index, is probably well withinSolr's capabilities.

There should be no reason to *ever* increase maxWarmingSearchers. Ifyou see the warning about this, the fix is to reduce your commitfrequency, not increase the value. Increasing the value can lead tomemory and performance problems. The fact that this value is even beingdiscussed, and that the value has been changed on your setup, has methinking that there may be more commits happening than theevery-five-minute autocommit.

For automatic commits, I have some recommendations for everyone to startwith, and then adjust if necessary: autoCommit: maxTime of 60000,openSearcher false. autoSoftCommit, maxTime of 120000. Neither oneshould have maxDocs configured.

It should take far less than 20 seconds to index a 500 document batch,especially when they are small enough for 300K of them to produce a400MB index. There are only a few problems I can imagine right now thatcould cause such slow indexing, having no real information to go on: 1)The analysis chains in your schema are exceptionally heavy and take along time to run. 2) There is a performance issue happening that wehave not yet figured out. 3) Your indexing request includes a commit,and the commit is happening very slowly.

Here is a log entry on one of my indexes showing 1000 documents beingadded in 777 milliseconds. The index that this is happening on is about40GB in size, with about 30 million documents. I have redacted part ofthe uniqueKey values in this log, to hide the sources of our data:

2017-11-04 09:30:14.325 INFO (qtp1394336709-42397) [ x:spark6live]o.a.s.u.p.LogUpdateProcessorFactory [spark6live] webapp=/solrpath=/update params={wt=javabin&version=2}{add=[REDACTEDsix557224(1583127266377859072), REDACTEDsix557228 (1583127266381004800),REDACTEDtwo979483 (1583127266381004801), REDACTEDtwo979488(1583127266382053376), REDACTEDtwo979490 (1583127266383101952),REDACTEDsix557260 (1583127266383101953), REDACTEDsix557242(1583127266384150528), REDACTEDsix557258 (1583127266385199104),REDACTEDsix557247 (1583127266385199105), REDACTEDsix557276(1583127266394636288), ... (1000 adds)]} 0 777

The rate I'm getting here of 1000 docs in 777 milliseconds is a ratethat I consider to be pretty slow, especially because my indexing issingle-threaded. But it works for us. What you're seeing where 500documents takes 20 seconds is slower than I've EVER seen, except insituations where there's a serious problem. On a system in good health,with multiple threads indexing, Solr should be able to index severalthousand documents every second.

Is the indexing program running on the same machine as Solr, or onanother machine? For best results, it should be on a different machine,accessing Solr via HTTP. This is so that whatever load the indexingprogram creates does not take CPU, memory, and I/O resources away from Solr.

What OS is Solr running on? If more information is needed, it will be agood idea to know precisely how to gather that information.

Overall, based on the information currently available, you should not behaving the problems you are. So there must be something about yoursetup that's not configured correctly beyond the information we'vealready got. It could be directly Solr-related, or something elseindirectly causing problems. I do not yet know exactly what informationwe might need to help.

Can you share an entire solr.log file that covers enough time so thatthere is both indexing and querying happening? If it also covers thatnode going down, that would be even better. You'll probably need to usea file-sharing website to share the log -- I'm surprised your GC logmade it to the list.


Thanks,
Shawn

Re: SolrClould 6.6 stability challenges

Reply via email to