Thanks for the reply Shawn, I can always count on you :). We are using 10GB heaps and have over 100GB of OS cache free to answer the JVM question, Young has about 50% of the heap, all CMS. Our max number of processes for the JVM user is 10k, which is where Solr dies when it blows up with 'cannot create native thread'.
I also want to say this is system related, but I am seeing this occur on all 3 servers, which are brand-new Dell R720s. I'm not saying this is impossible, but I don't see much to suggest that, and it would need to be one hell of a coincidence. To add more confusion to the mix, we actually run a 2nd SolrCloud cluster on the same Solr, Jetty and JVM versions that do not exhibit this issue, although using a completely different schema, servers and access-patterns, although it is also at high-TPS. That is some evidence to say the current software stack is OK, or maybe this only occurs under an extreme load that 2nd cluster does not see, or lastly only with a certain schema. Lastly, to add a bit more detail to my original description, so far I have tried: - Entirely rebuilding my cluster from scratch, reinstalling all deps, configs, reindexing the data (in case I screwed up somewhere). The EXACT same issue occurs under load about 20-45 minutes in. - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue occurs after some load. - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after some load. Cheers, Tim On 25 July 2013 17:13, Shawn Heisey <s...@elyograg.org> wrote: > On 7/25/2013 5:44 PM, Tim Vaillancourt wrote: > >> The transaction log error I receive after about 10-30 minutes of load >> testing is: >> >> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.**SolrException] >> Failure to open existing log file (non fatal) >> /opt/easw/easw_apps/easo_solr_**cloud/solr/xmshd_shard3_** >> replica2/data/tlog/tlog.**0000000000000000078:org.**apache.solr.common.** >> SolrException: >> java.io.EOFException >> > > <snip> > > > Caused by: java.io.EOFException >> at >> org.apache.solr.common.util.**FastInputStream.**readUnsignedByte(** >> FastInputStream.java:73) >> at >> org.apache.solr.common.util.**FastInputStream.readInt(** >> FastInputStream.java:216) >> at >> org.apache.solr.update.**TransactionLog.readHeader(** >> TransactionLog.java:266) >> at >> org.apache.solr.update.**TransactionLog.<init>(**TransactionLog.java:160) >> ... 25 more >> " >> > > This looks to me like a system problem. RHEL should be pretty solid, I > use CentOS without any trouble. My initial guesses are a corrupt > filesystem, failing hardware, or possibly a kernel problem with your > specific hardware. > > I'm running Jetty 8, which is the version that the example uses. Could > Jetty 9 be a problem here? I couldn't really say, though my initial guess > is that it's not a problem. > > I'm running Oracle Java 1.7.0_13. Normally later releases are better, but > Java bugs do exist and do get introduced in later releases. Because you're > on the absolute latest, I'm guessing that you had the problem with an > earlier release and upgraded to see if it went away. If that's what > happened, it is less likely that it's Java. > > My first instinct would be to do a 'yum distro-sync' followed by 'touch > /forcefsck' and reboot with console access to the server, so that you can > deal with any fsck problems. Perhaps you've already tried that. I'm aware > that this could be very very hard to get pushed through strict change > management procedures. > > I did some searching. SOLR-4519 is a different problem, but it looks like > it has a similar underlying exception, with no resolution. It was filed > When Solr 4.1.0 was current. > > Could there be a resource problem - heap too small, not enough OS disk > cache, etc? > > Thanks, > Shawn > >