Hello, Dominique.
What did it log? Which exception?
Do you have a chance to review heap dump? What did consume whole heap?

On Sun, May 17, 2020 at 11:05 AM Dominique Bejean <dominique.bej...@eolya.fr>
wrote:

> Hi,
>
> I have a six node Solrcoud that suddenly has its six nodes failed with OOM
> at the same time.
> This can happen even when the Solrcloud is not under heavy load and there
> is no indexing.
>
> I do not see any raison for this to happen. Here are the description of the
> issue. Thank you for your suggestions and advices.
>
>
> One or two hours before the nodes stop with OOM, we see this scenario on
> all six nodes during the same five minutes time frame :
> * a little bit more young gc : from one each second (duration<0.05secs) to
> one each two or three seconds (duration <0.15 sec)
> * full gc start occurs each 5sec with 0 bytes reclaimed
> * young gc start reclaim less bytes
> * long full gc start reclaim bytes but with less and less reclaimed bytes
> * then no more young GC
> Here are GC graphs : https://www.eolya.fr/solr_issue_gc.png
>
>
> Just before the problem occurs :
> * there is no more requests per seconds
> * no update/commit/merge
> * CPU usage and load are low
> * disk I/O are low
> After the problem starts, requests become longer and longer but still no
> increase of CPU usage or disk I/O
>
>
> During last issue, we dumped the threads on one node just before OOM but
> unfortunately, more than one hour after the problem starts.
> 85% of threads (more than 3000) are BLOCKED and related to log4j
> Solr either try to log slow query or try to log problems in requesthandler
> at org.apache.solr.common.SolrException.log(SolrException.java:148)
> at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:204)
>
> This high count of BLOCKED threads is more a consequence than a cause. We
> will dump threads each minute until the next issue.
>
>
> About Solr environment :
> * Solr 6.6
> * Java Oracle 1.8.0_112 25.112-b15
>
> * 1 collection with 10 millions small documents
> * 3 shards x 2 replicas
> * 3.5 millions docs per core
> * 90 Gb index size per core
>
> * Server with 6 processors and 90 Gb of RAM
> * Swappiness set to 1, nearly no swap used
> * 4Gb Heap used nearly between 25 to 60% before young GC and one full GC (3
> seconds) each 15 to 30 minutes when all is fine.
>
> * Default JVM settings with CMS GC
> * JMX enabled
> * Average Request per seconds in pic on one core : 170, but during the last
> issue the Average Request per seconds was 30 !!!
> * Average Time per seconds : < 30 ms
>
> About updates :
> * Very few add/updates in general
> * Some deleteByQuery (nearly 2000 per day) but not before the problem
> occurs
> * autocommit maxTime:15000ms
>
> About queries :
> * Queries are standard queries or suggesters
> * Queries generate facets but there is no fields with very high number of
> unique values
> * No grouping
> * High usage of function query for relevance computing
>
>
> Thank you.
>
> Dominique
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to