Stack trace:

http://timvaillancourt.com.s3.amazonaws.com/tmp/solrcloud.nodeC.2013-07-25-16.jstack.gz

Cheers!

Tim


On 25 July 2013 16:44, Tim Vaillancourt <t...@elementspace.com> wrote:

> Hey guys,
>
> I am reaching out to the Solr list with a very vague issue: under high
> load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas
> (2 cores per instance), I eventually see failure messages related to
> transaction logs, and shortly after these stacktraces occur the cluster
> starts to fall apart.
>
> To explain my setup:
> - SolrCloud 4.3.1.
> - Jetty 9.x.
> - Oracle/Sun JDK 1.7.25 w/CMS.
> - RHEL 6.x 64-bit.
> - 3 instances, 1 per server.
> - 3 shards.
> - 2 replicas per shard.
>
> The transaction log error I receive after about 10-30 minutes of load
> testing is:
>
> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
> Failure to open existing log file (non fatal)
> /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.0000000000000000078:org.apache.solr.common.SolrException:
> java.io.EOFException
>         at
> org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:182)
>         at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233)
>         at
> org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83)
>         at
> org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:138)
>         at
> org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:125)
>         at
> org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:95)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>         at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525)
>         at
> org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596)
>         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:805)
>         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
>         at
> org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894)
>         at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:982)
>         at
> org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597)
>         at
> org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: java.io.EOFException
>         at
> org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
>         at
> org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
>         at
> org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
>         at
> org.apache.solr.update.TransactionLog.<init>(TransactionLog.java:160)
>         ... 25 more
> "
>
> Eventually after a few of these stack traces, the cluster starts to lose
> shards and replicas fail. Jetty then creates hung threads until hitting
> OutOfMemory on native threads due to the maximum process ulimit.
>
> I know this is quite a vague issue, so I'm not expecting a silver-bullet
> answer, but I was wondering if anyone has suggestions on where to look
> next? Does this sound Solr-related at all, or possibly system? Has anyone
> seen this issue before, or has any hypothesis how to find out more?
>
> I will reply shortly with a thread dump, taken from 1 locked-up node.
>
> Thanks for any suggestions!
>
> Tim
>

Reply via email to