On 10/1/2015 1:26 PM, Rallavagu wrote: > Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3. > > See following errors in ZK and Solr and they are connected. > > When I see the following error in Zookeeper, > > unexpected error, closing socket connection and attempting reconnect > java.io.IOException: Packet len11823809 is out of range!
This is usually caused by the overseer queue (stored in zookeeper) becoming extraordinarily huge, because it's being flooded with work entries far faster than the overseer can process them. This causes the znode where the queue is stored to become larger than the maximum size for a znode, which defaults to about 1MB. In this case (reading your log message that says len11823809), something in zookeeper has gotten to be 11MB in size, so the zookeeper client cannot read it. I think the zookeeper server code must be handling the addition of children to the queue znode through a code path that doesn't pay attention to the maximum buffer size, just goes ahead and adds it, probably by simply appending data. I'm unfamiliar with how the ZK database works, so I'm guessing here. If I'm right about where the problem is, there are two workarounds to your immediate issue. 1) Delete all the entries in your overseer queue using a zookeeper client that lets you edit the DB directly. If you haven't changed the cloud structure and all your servers are working, this should be safe. 2) Set the jute.maxbuffer system property on the startup commandline for all ZK servers and all ZK clients (Solr instances) to a size that's large enough to accommodate the huge znode. In order to do the deletion mentioned in option 1 above,you might need to increase jute.maxbuffer on the servers and the client you use for the deletion. These are just workarounds. Whatever caused the huge queue in the first place must be addressed. It is frequently a performance issue. If you go to the following link, you will see that jute.maxbuffer is considered an unsafe option: http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options In Jira issue SOLR-7191, I wrote the following in one of my comments: "The giant queue I encountered was about 850000 entries, and resulted in a packet length of a little over 14 megabytes. If I divide 850000 by 14, I know that I can have about 60000 overseer queue entries in one znode before jute.maxbuffer needs to be increased." https://issues.apache.org/jira/browse/SOLR-7191?focusedCommentId=14347834 Thanks, Shawn