Stack trace: http://timvaillancourt.com.s3.amazonaws.com/tmp/solrcloud.nodeC.2013-07-25-16.jstack.gz
Cheers! Tim On 25 July 2013 16:44, Tim Vaillancourt <t...@elementspace.com> wrote: > Hey guys, > > I am reaching out to the Solr list with a very vague issue: under high > load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas > (2 cores per instance), I eventually see failure messages related to > transaction logs, and shortly after these stacktraces occur the cluster > starts to fall apart. > > To explain my setup: > - SolrCloud 4.3.1. > - Jetty 9.x. > - Oracle/Sun JDK 1.7.25 w/CMS. > - RHEL 6.x 64-bit. > - 3 instances, 1 per server. > - 3 shards. > - 2 replicas per shard. > > The transaction log error I receive after about 10-30 minutes of load > testing is: > > "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException] > Failure to open existing log file (non fatal) > /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.0000000000000000078:org.apache.solr.common.SolrException: > java.io.EOFException > at > org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:182) > at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233) > at > org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83) > at > org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:138) > at > org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:125) > at > org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:95) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:525) > at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525) > at > org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:805) > at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618) > at > org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894) > at > org.apache.solr.core.CoreContainer.create(CoreContainer.java:982) > at > org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597) > at > org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > Caused by: java.io.EOFException > at > org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73) > at > org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216) > at > org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266) > at > org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:160) > ... 25 more > " > > Eventually after a few of these stack traces, the cluster starts to lose > shards and replicas fail. Jetty then creates hung threads until hitting > OutOfMemory on native threads due to the maximum process ulimit. > > I know this is quite a vague issue, so I'm not expecting a silver-bullet > answer, but I was wondering if anyone has suggestions on where to look > next? Does this sound Solr-related at all, or possibly system? Has anyone > seen this issue before, or has any hypothesis how to find out more? > > I will reply shortly with a thread dump, taken from 1 locked-up node. > > Thanks for any suggestions! > > Tim >