Kojo: The solr logs should give you a much better idea of what the triggering event was.
Just increasing the heap doesn’t guarantee much, again the Solr logs will report the OOM exception if it’s memory-related. You haven’t told us what your physical RAM is nor how much you’re allocating to heap, those would be helpful. As far as Solr not answering, It Depends (tm). How are you querying Solr? If it’s just using an HTTP request to the node that died, there’s no communication possible, the http end-point is down. If you’re using SolrJ or load balancer in front, then it should, indeed get to the live Solr and you should get a reply. I’ll add that from what you report, this system seems massivley over-sharded. I generally start my testing with the assumption that I can fit 50,000,000 documents per shard on a decent-sized box. So unless this configuration is for massive planned growth, the number of shards you have is far in excess of what you need. This isn’t the root cause of your problem, but it doesn’t help either…. Best, Erick > On Aug 12, 2019, at 7:47 AM, Kojo <rbsnk...@gmail.com> wrote: > > Hi, > I am using Solr cloud on this configuration: > > 2 boxes (one Solr in each box) > 4 instances per box > > At this moment I have an active collections with about 300.000 docs. The > other collections are not being queried. The acctive collection is > configured: > - shards: 16 > - replication factor: 2 > > These two Solrs (Solr1 and Solr2) use Zookeper (one box, one instance. No > zookeeper cluster) > > My application point to Solr1, and everything works fine, until suddenly on > instance of this Solr1 dies. This istance is on port 8983, the "main" > instance. I thought it could be related to memory usage, but we increase > RAM and JVM memory but it still dies. > The Solr1, the one wich dies,is the destination where I point my web > application. > > Here I have two questions that I hope you can help me: > > 1. Which log can I look for debug this issue? > 2. After this instance dies, the Solr cloud does not answer to my web > application. Is this correct? I thougth that the replicas should answer if > one shard, instance or one box goes down. > > Regards, > Koji