On 7/25/2013 6:53 PM, Tim Vaillancourt wrote: > Thanks for the reply Shawn, I can always count on you :). > > We are using 10GB heaps and have over 100GB of OS cache free to answer the > JVM question, Young has about 50% of the heap, all CMS. Our max number of > processes for the JVM user is 10k, which is where Solr dies when it blows > up with 'cannot create native thread'. > > I also want to say this is system related, but I am seeing this occur on > all 3 servers, which are brand-new Dell R720s. I'm not saying this is > impossible, but I don't see much to suggest that, and it would need to be > one hell of a coincidence.
Nice hardware. I have some R720xd servers for another project unrelated to Solr, love them. I know a little about Dell servers. If you haven't done so already, I would install the OpenManage repo and get the firmware fully updated - BIOS, RAID, and LAN in particular. Instructions that are pretty easy to follow: http://linux.dell.com/repo/hardware/latest/ For process/file limits, I have the following in /etc/security/limits.conf on systems that aren't using Cloud: ncindex hard nproc 6144 ncindex soft nproc 4096 ncindex hard nofile 65535 ncindex soft nofile 49151 > To add more confusion to the mix, we actually run a 2nd SolrCloud cluster > on the same Solr, Jetty and JVM versions that do not exhibit this issue, > although using a completely different schema, servers and access-patterns, > although it is also at high-TPS. That is some evidence to say the current > software stack is OK, or maybe this only occurs under an extreme load that > 2nd cluster does not see, or lastly only with a certain schema. This is a big reason why I think you should make sure you're fully up to date on your firmware, as the hardware seems to be one strong difference. As much as I love Dell server hardware, firmware issues are relatively common, especially on early versions of the latest generation, which includes the R720. > Lastly, to add a bit more detail to my original description, so far I have > tried: > > - Entirely rebuilding my cluster from scratch, reinstalling all deps, > configs, reindexing the data (in case I screwed up somewhere). The EXACT > same issue occurs under load about 20-45 minutes in. > - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue > occurs after some load. > - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after > some load. The only other thing I can think of is increasing your zkClientTimeout to 30 seconds or so and trying Solr 4.4 so you have SOLR-4899 and SOLR-4805. That's very definitely a shot in the dark. Thanks, Shawn