Hi All:
We are trying to index a large number of documents in solrcloud and keep
seeing the following error: org.apache.solr.common.SolrException: Service
Unavailable, or org.apache.solr.common.SolrException: Service Unavailable
but with a similar stack:
request: http://wp-np2-c0:8983/solr/uniprot/update?wt=javabin=2
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:320)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$22(ExecutorUtil.java:229)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$57/936653983.run(Unknown
Source)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
the settings are:
5 nodes in the cluster with each 16g memory, for the collection, it is
defined with 5 shards, and replicate factor 2. the total number of
documents is about 90m, each document size is quite large as well.
we have also 5 zookeeper instances running on each node.
On the solr side, we can see error like:
solr.log.3-Error from server at
http://wp-np2-c4.ebi.ac.uk:8983/solr/uniprot_shard5_replica1: Server Error
solr.log.3-request:
http://wp-np2-c4.ebi.ac.uk:8983/solr/uniprot_shard5_replica1/update?update.distrib=TOLEADER=http%3A%2F%2Fwp-np2-c0.ebi.ac.uk%3A8983%2Fsolr%2Funiprot_shard2_replica1%2F=javabin=2
solr.log.3-Remote error message: Async exception during distributed update:
Connect to wp-np2-c2.ebi.ac.uk:8983 timed out
solr.log.3- at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:948)
solr.log.3- at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1679)
solr.log.3- at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:182)
--
solr.log.3- at
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
solr.log.3- at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
solr.log.3- at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
solr.log.3- at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
solr.log.3- at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
solr.log.3- at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
solr.log.3- at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
solr.log.3- at java.lang.Thread.run(Thread.java:745)
The strange bit is this exception doesn't seem to be captured by the
try/catch block in our main thread. and the cluster seems in the good
health (all nodes up) after the job done, we just missing lots of
documents!
any suggestion where we should look to resolve this problem?
Best Regards,
Wudong