bq: Anyway, adding a separate node with 4th zookeeper instance might help, right?
no. The formula for a quorum is (num_zookeeper_nodes)/2 + 1. So adding a fourth node requires that _three_ of them be up, i.e. only one can be unreachable. Which is the same number as with 4. It actually makes failure _more_ likely to have an even number of ZK instances. bq: ...since they share same nodes with Solr instances As separate processes? Or embedded? If the latter, the cure is obvious. If the former, consider running the ZK instances on other nodes perhaps... Best, Erick On Mon, Aug 11, 2014 at 8:28 AM, Alexander S. (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/SOLR-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092884#comment-14092884 > ] > > Alexander S. commented on SOLR-3274: > ------------------------------------ > > Hi, thanks for the response. > > bq. Well you never know > I've checked nodes status, that 3rd node was online all the time and there > were no any load on it. > > bq. In a 3-node ZK-cluster you need at least 2 healthy ZK-nodes connected > with each other for the cluster to be operational. > That should be the problem since 2 other ZK instances might be > (theoretically) unavailable because of heavy load (since they share same > nodes with Solr instances). Both nodes have 16 CPU cores, 48G of memory and > RAID 10 (SSD), I thought it would be hard to get performance issues there. > Anyway, adding a separate node with 4th zookeeper instance might help, > right? > > > ZooKeeper related SolrCloud problems > > ------------------------------------ > > > > Key: SOLR-3274 > > URL: https://issues.apache.org/jira/browse/SOLR-3274 > > Project: Solr > > Issue Type: Bug > > Components: SolrCloud > > Affects Versions: 4.0-ALPHA > > Environment: Any > > Reporter: Per Steffensen > > > > Same setup as in SOLR-3273. Well if I have to tell the entire truth we > have 7 Solr servers, running 28 slices of the same collection (collA) - all > slices have one replica (two shards all in all - leader + replica) - 56 > cores all in all (8 shards on each solr instance). But anyways... > > Besides the problem reported in SOLR-3273, the system seems to run fine > under high load for several hours, but eventually errors like the ones > shown below start to occur. I might be wrong, but they all seem to indicate > some kind of unstability in the collaboration between Solr and ZooKeeper. I > have to say that I havnt been there to check ZooKeeper "at the moment where > those exception occur", but basically I dont believe the exceptions occur > because ZooKeeper is not running stable - at least when I go and check > ZooKeeper through other "channels" (e.g. my eclipse ZK plugin) it is always > accepting my connection and generally seems to be doing fine. > > Exception 1) Often the first error we see in solr.log is something like > this > > {code} > > Mar 22, 2012 5:06:43 AM org.apache.solr.common.SolrException log > > SEVERE: org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - > Updates are disabled. > > at > org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:678) > > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:250) > > at > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140) > > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:80) > > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59) > > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) > > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:407) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) > > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > > at org.mortbay.jetty.Server.handle(Server.java:326) > > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > > at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) > > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) > > at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) > > at > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > > at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) > > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > {code} > > I believe this error basically occurs because SolrZkClient.isConnected > reports false, which means that its internal "keeper.getState" does not > return ZooKeeper.States.CONNECTED. Im pretty sure that it has been > CONNECTED for a long time, since this error starts occuring after several > hours of processing without this problem showing. But why is it suddenly > not connected anymore?! > > Exception 2) We also see errors like the following, and if Im not > mistaken, they start occuring shortly after "Exception 1)" (above) shows > for the fist time > > {code} > > Mar 22, 2012 5:07:26 AM org.apache.solr.common.SolrException log > > SEVERE: org.apache.solr.common.SolrException: no servers hosting shard: > > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) > > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:123) > > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:662) > > {code} > > Please note that the exception says "no servers hosting shard: <blank>". > Looking at the code a "shard"-string was actually supposed to be written at > <blank>. Basically this means that HttpShardHandler.submit was called with > an empty "shard"-string parameter. But who does this? > CoreAdminHandler.handleDistribUrlAction or SearchHandler.handleRequestBody > or SyncStrategy or PeerSync or... I dont know, and maybe it is not that > relevant, because I guess they all get the "shard"-string from ZooKeeper. > Again something pointing in the direction of unstable collaboration between > Solr and ZooKeeper. > > Exception 3) We also see exceptions like this > > {code} > > Mar 25, 2012 3:05:38 PM org.apache.solr.common.cloud.ZkStateReader$3 > process > > WARNING: ZooKeeper watch triggered, but Solr cannot talk to ZK > > Mar 25, 2012 3:05:38 PM org.apache.solr.cloud.LeaderElector$1 process > > WARNING: > > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for > /collections/collA/leader_elect/slice26/election > > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:118) > > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > > at > org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249) > > at > org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:266) > > at > org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:263) > > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) > > at > org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:263) > > at > org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:92) > > at > org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57) > > at > org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121) > > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > > {code} > > Maybe this will we usable for some bug-fixing or for making the code > more stable. I know 4.0 is not stable/released yet, and that we therefore > should expect this kind of errors at the moment. So this is not negative > criticism - just reporting of issues observed when using SolrCloud features > under high load for several days. Any feedback is more than welcome. > > Regards, Per Steffensen > > > > -- > This message was sent by Atlassian JIRA > (v6.2#6252) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
