What led you to trying that? I'm not connecting the dots in my head - the exception and the solution.
- Mark On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote: > Hi, > > I think the issue was not in zk client timeout, but POST request size. When > I increased the value for Request.maxFormContentSize in jetty.xml I don't > see this issue any more. > > Regards. > > On 3 February 2013 01:56, Mark Miller <markrmil...@gmail.com> wrote: > >> Do you see anything about session expiration in the logs? That is the >> likely culprit for something like this. You may need to raise the timeout: >> http://wiki.apache.org/solr/SolrCloud#FAQ >> >> If you see no session timeouts, I don't have a guess yet. >> >> - Mark >> >> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote: >> >>> I'm experiencing same problem in Solr4.1 during bulk loading. After 50 >>> minutes of indexing the following error starts to occur: >>> >>> INFO: [core] webapp=/solr path=/update params={} {} 0 4 >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are >> the >>> leader, but locally we don't think so >>> at >>> >> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295) >>> at >>> >> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230) >>> at >>> >> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343) >>> at >>> >> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) >>> at >>> >> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387) >>> at >>> >> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112) >>> at >>> >> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96) >>> at >>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60) >>> at >>> >> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >>> at >>> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>> at >>> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) >>> at >>> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448) >>> at >>> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269) >>> at >>> >> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) >>> at >>> >> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) >>> at >>> >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) >>> at >>> >> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) >>> at >>> >> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) >>> at >>> >> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) >>> at >>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) >>> at >>> >> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) >>> at >>> >> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) >>> at >>> >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) >>> at >>> >> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) >>> at >>> >> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) >>> at >>> >> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) >>> at org.eclipse.jetty.server.Server.handle(Server.java:365) >>> at >>> >> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) >>> at >>> >> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) >>> at >>> >> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) >>> at >>> >> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) >>> at >> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) >>> at >>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) >>> at >>> >> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) >>> at >>> >> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) >>> at >>> >> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) >>> at >>> >> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) >>> at java.lang.Thread.run(Unknown Source) >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log >>> Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext >>> waitForReplicasToComeUp >>> INFO: Waiting until we see more replicas up: total=2 found=1 >> timeoutin=50699 >>> >>> Then leader tries to sync with replica and after it finishes I can >> continue >>> loading. >>> None of SolrCloud nodes was restarted during that time. I don't remember >>> such behaviour in Solr4.0. Could it be related with the number of fields >>> indexed during loading ? I have a collection with about 2400 fields. I >>> can't reproduce same issue for other collections with much less fields >> per >>> record. >>> Regards. >>> >>> On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com >>> wrote: >>> >>>> Just an update on this issue: >>>> We tried by increasing zookeeper client timeout settings to 30000ms in >>>> solr.xml (i think default is 15000ms), and haven't seen any issues from >> our >>>> tests. >>>> <cores ......... zkClientTimeout="30000" > >>>> >>>> Thanks, Sudhakar. >>>> >>>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni >>>> <maddineni...@gmail.com>wrote: >>>> >>>>> We saw this error again today during our load test - basically, >> whenever >>>>> session is getting expired on the leader node, we are seeing the >>>>> error.After this happens, leader(001) is going into 'recovery' mode and >>>> all >>>>> the index updates are failing with "503- service unavailable" error >>>>> message.After some time(once recovery is successful), roles are swapped >>>>> i.e. 001 acting as the replica and 003 as leader. >>>>> >>>>> Btw, do you know why the connection to zookeeper[solr->zk] getting >>>>> interrupted in the middle? >>>>> is it because of the load(no of updates) we are putting on the cluster? >>>>> >>>>> Here is the exception stack trace: >>>>> >>>>> *Dec* *7*, *2012* *2:28:03* *PM* >>>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader* >>>>> *WARNING:* >>>> *org.apache.zookeeper.KeeperException$SessionExpiredException:* >>>> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader* >>>>> *at* >>>> >> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*) >>>>> *at* >>>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*) >>>>> *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927* >>>>> ) >>>>> *at* >>>> >> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*) >>>>> *at* >>>> >> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*) >>>>> *at* >>>> >> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*) >>>>> *at* >>>> >> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*) >>>>> *at* >>>> >> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*) >>>>> *at* >>>> >> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*) >>>>> *at* *java.lang.Thread.run*(*Unknown* *Source*) >>>>> >>>>> Thx,Sudhakar. >>>>> >>>>> >>>>> >>>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni < >>>> maddineni...@gmail.com >>>>>> wrote: >>>>> >>>>>> Erick: >>>>>> Not seeing any page caching related issues... >>>>>> >>>>>> Mark: >>>>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in >>>> the >>>>>> zookeeper cluster state? I was also looking at the leader(001) logs at >>>> that >>>>>> time and seeing errors related to "*SEVERE: ClusterState says we are >> the >>>>>> leader, but locally we don't think so*". >>>>>> 2.Also, all of our servers in cluster were gone down when the index >>>>>> updates were running in parallel along with this issue.Do you see this >>>>>> related to the session expiry on 001? >>>>>> >>>>>> Here are the logs on 001 >>>>>> ========================= >>>>>> >>>>>> Dec 4, 2012 12:12:29 PM >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader >>>>>> WARNING: >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired for /overseer_elect/leader >>>>>> at >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118) >>>>>> at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:42) >>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) >>>>>> Dec 4, 2012 12:12:29 PM >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader >>>>>> INFO: According to ZK I >>>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a >>>> leader. >>>>>> >>>>>> Dec 4, 2012 12:12:29 PM >>>> org.apache.solr.cloud.OverseerCollectionProcessor >>>>>> run >>>>>> WARNING: Overseer cannot talk to ZK >>>>>> >>>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: There was a problem finding the leader in >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props >>>>>> at >>>>>> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >>>>>> at >>>>>> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) >>>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: There was a problem finding the leader in >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props >>>>>> at >>>>>> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >>>>>> at >>>>>> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) >>>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: There was a problem making a request to the >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on >>>> state >>>>>> down for <001>:8080_solr but I still do not see the request state. I >> see >>>>>> state: active live:true >>>>>> at >>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: There was a problem making a request to the >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on >>>> state >>>>>> down for <001>:8080_solr but I still do not see the request state. I >> see >>>>>> state: active live:true >>>>>> at >>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>>>>> .... >>>>>> .... >>>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: There was a problem finding the leader in >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props >>>>>> at >>>>>> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) >>>>>> .... >>>>>> .... >>>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log >>>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem >>>>>> finding the leader in zk >>>>>> at >>>>>> >>>> >> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080) >>>>>> at >>>>>> >>>> >> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273) >>>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader >>>>>> SEVERE: Error getting leader from zk >>>>>> org.apache.solr.common.SolrException: *There is conflicting >> information >>>>>> about the leader of shard: shard1 our state says:http:// >>>> <001>:8080/solr/core1/ >>>>>> but zookeeper says:http://<003>:8080/solr/core1/* >>>>>> * at >>>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)* >>>>>> * at >> org.apache.solr.cloud.ZkController.register(ZkController.java:577)* >>>>>> Dec 4, 2012 12:22:30 PM >>>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess >>>>>> INFO: Running the leader process. >>>>>> .... >>>>>> .... >>>>>> >>>>>> Thanks for your inputs. >>>>>> Sudhakar. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com >>>>> wrote: >>>>>> >>>>>>> Yes - it means that 001 went down (or more likely had it's connection >>>> to >>>>>>> ZooKeeper interrupted? that's what I mean about a session timeout - >> if >>>> the >>>>>>> solr->zk link is broken for longer than the session timeout that will >>>>>>> trigger a leader election and when the connection is reestablished, >> the >>>>>>> node will have to recover). That waiting should stop as soon as 001 >>>> came >>>>>>> back up or reconnected to ZooKeeper. >>>>>>> >>>>>>> In fact, this waiting should not happen in this case - but only on >>>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming >>>> very >>>>>>> soon!): >>>>>>> >>>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the >>>> code >>>>>>> path >>>>>>> for a fresh cluster start rather than fail over. (Mark Miller) >>>>>>> >>>>>>> - Mark >>>>>>> >>>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni < >> maddineni...@gmail.com >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Yep, after restarting, cluster came back to normal state.We will run >>>>>>> couple of more tests and see if we could reproduce this issue. >>>>>>>> >>>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until >>>>>>> we see more replicas' message.From the logs, we can see that leader >>>>>>> election process started on 003 which was the replica for 001 >>>>>>> initially.That means leader 001 went down at that time? >>>>>>>> >>>>>>>> logs on 003: >>>>>>>> ======== >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> runLeaderProcess >>>>>>>> INFO: Running the leader process. >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> shouldIBeLeader >>>>>>>> INFO: Checking if I should try and be the leader. >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> shouldIBeLeader >>>>>>>> INFO: My last published State was Active, it's okay to be the >>>>>>> leader. >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> runLeaderProcess >>>>>>>> INFO: I may be the new leader - try and sync >>>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close >>>>>>>> WARNING: Stopping recovery for >>>> zkNodeName=<003>:8080_solr_core >>>>>>> core=core1. >>>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync >>>>>>>> INFO: Sync replicas to http://<003>:8080/solr/core1/ >>>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync >>>>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr START >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100 >>>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process >>>>>>>> INFO: Updating live nodes -> this message is on 002 >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse >>>>>>>> WARNING: PeerSync: core=core1 url=http://<003>:8080/solr >>>>>>> exception talking to <001>:8080/solr/core1/, failed >>>>>>>> org.apache.solr.client.solrj.SolrServerException: Timeout >>>>>>> occured while waiting response from server at: <001>:8080/solr/core1 >>>>>>>> at >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409) >>>>>>>> at >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>> at >>>>>>> >>>> >> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166) >>>>>>>> at >>>>>>> >>>> >> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133) >>>>>>>> at >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown >>>>>>> Source) >>>>>>>> at java.util.concurrent.FutureTask.run(Unknown Source) >>>>>>>> at >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) >>>>>>>> at >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown >>>>>>> Source) >>>>>>>> at java.util.concurrent.FutureTask.run(Unknown Source) >>>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown >> Source) >>>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >>>>>>>> at java.lang.Thread.run(Unknown Source) >>>>>>>> Caused by: java.net.SocketTimeoutException: Read timed out >>>>>>>> at java.net.SocketInputStream.socketRead0(Native >>>> Method) >>>>>>>> at java.net.SocketInputStream.read(Unknown Source) >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync >>>>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. >>>>>>> sync failed >>>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log >>>>>>>> SEVERE: Sync Failed >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> rejoinLeaderElection >>>>>>>> INFO: There is a better leader candidate than us - going back >>>>>>> into recovery >>>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery >>>>>>>> INFO: Running recovery - first canceling any ongoing recovery >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run >>>>>>>> INFO: Starting recovery process. core=core1 >>>>>>> recoveringAfterStartup=false >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery >>>>>>>> INFO: Attempting to PeerSync from <001>:8080/solr/core1/ >>>>>>> core=core1 - recoveringAfterStartup=false >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync >>>>>>>> INFO: PeerSync: core=core1 url=http://<003>:8080/solr START >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100 >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> runLeaderProcess >>>>>>>> INFO: Running the leader process. >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> waitForReplicasToComeUp >>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>> timeoutin=179999 >>>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> waitForReplicasToComeUp >>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>> timeoutin=179495 >>>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>> waitForReplicasToComeUp >>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>> timeoutin=178985 >>>>>>>> .... >>>>>>>> .... >>>>>>>> >>>>>>>> Thanks for your help. >>>>>>>> Sudhakar. >>>>>>>> >>>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com> >>>>>>> wrote: >>>>>>>> The waiting logging had to happen on restart unless it's some kind >> of >>>>>>> bug. >>>>>>>> >>>>>>>> Beyond that, something is off, but I have no clue why - it seems >> your >>>>>>> clusterstate.json is not up to date at all. >>>>>>>> >>>>>>>> Have you tried restarting the cluster then? Does that help at all? >>>>>>>> >>>>>>>> Do you see any exceptions around zookeeper session timeouts? >>>>>>>> >>>>>>>> - Mark >>>>>>>> >>>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni < >>>> maddineni...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Mark, >>>>>>>>> >>>>>>>>> Yes, I am able to access all of the nodes under each shard from >>>>>>> solrcloud >>>>>>>>> admin UI. >>>>>>>>> >>>>>>>>> >>>>>>>>> - *It kind of looks like the urls solrcloud is using are not >>>>>>> accessible. >>>>>>>>> When you go to the admin page and the cloud tab, can you access >>>>>>> the urls it >>>>>>>>> shows for each shard? That is, if you click on of the links or >>>>>>> copy and >>>>>>>>> paste the address into a web browser, does it work?* >>>>>>>>> >>>>>>>>> Actually, I got these errors when my document upload task/job was >>>>>>> running, >>>>>>>>> not during the cluster restart. Also,job ran fine initially for the >>>>>>> first >>>>>>>>> one hour and started throwing these errors after indexing some >>>> docx. >>>>>>>>> >>>>>>>>> Thx, Sudhakar. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com >>>>> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> It kind of looks like the urls solrcloud is using are not >>>>>>> accessible. When >>>>>>>>>> you go to the admin page and the cloud tab, can you access the >>>> urls >>>>>>> it >>>>>>>>>> shows for each shard? That is, if you click on of the links or >>>> copy >>>>>>> and >>>>>>>>>> paste the address into a web browser, does it work? >>>>>>>>>> >>>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not >>>>>>> auto >>>>>>>>>> detecting the right one. Make sure the ports like right too. >>>>>>>>>> >>>>>>>>>>> waitForReplicasToComeUp >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>>>>>> timeoutin=179999 >>>>>>>>>> >>>>>>>>>> That happens when you stop the cluster and try to start it again - >>>>>>> before >>>>>>>>>> a leader is chosen, it will wait for all known replicas fora shard >>>>>>> to come >>>>>>>>>> up so that everyone can sync up and have a chance to be the best >>>>>>> leader. So >>>>>>>>>> at this point it was only finding one of 2 known replicas and >>>>>>> waiting for >>>>>>>>>> the second to come up. After a couple minutes (configurable) it >>>>>>> will just >>>>>>>>>> continue anyway without the missing replica (if it doesn't show >>>> up). >>>>>>>>>> >>>>>>>>>> - Mark >>>>>>>>>> >>>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni < >>>>>>> maddineni...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> We are uploading solr documents to the index in batches using 30 >>>>>>> threads >>>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit >>>>>>> set to >>>>>>>>>>> 10000. >>>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method >>>>>>> to add >>>>>>>>>>> docx. >>>>>>>>>>> And, we have the following commit settings in solrconfig: >>>>>>>>>>> >>>>>>>>>>> <autoCommit> >>>>>>>>>>> <maxTime>300000</maxTime> >>>>>>>>>>> <maxDocs>10000</maxDocs> >>>>>>>>>>> <openSearcher>false</openSearcher> >>>>>>>>>>> </autoCommit> >>>>>>>>>>> >>>>>>>>>>> <autoSoftCommit> >>>>>>>>>>> <maxTime>1000</maxTime> >>>>>>>>>>> </autoSoftCommit> >>>>>>>>>>> >>>>>>>>>>> Cluster Details: >>>>>>>>>>> ---------------------------- >>>>>>>>>>> solr version - 4.0 >>>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes] >>>>>>>>>>> numshards=2 , >>>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the >>>>>>>>>>> loadbalancer <vip> >>>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Logs:Getting the errors in the below sequence after uploading >>>> some >>>>>>> docx: >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> ----------------------------------------------------------------------------------------------------------- >>>>>>>>>>> 003 >>>>>>>>>>> Dec 4, 2012 12:11:46 PM >>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext >>>>>>>>>>> waitForReplicasToComeUp >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1 >>>>>>>>>>> timeoutin=179999 >>>>>>>>>>> >>>>>>>>>>> 001 >>>>>>>>>>> Dec 4, 2012 12:12:59 PM >>>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor >>>>>>>>>>> doDefensiveChecks >>>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't >>>>>>> think >>>>>>>>>> so >>>>>>>>>>> >>>>>>>>>>> 003 >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log >>>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - >>>>>>> retrying ... >>>>>>>>>>> >>>>>>>>>>> 001 >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log >>>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException: >>>>>>> Server at >>>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service >>>>>>> Unavailable >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>>>>> 001 >>>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log >>>>>>>>>>> SEVERE: Error while trying to recover. >>>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the >>>>>>> leader >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401) >>>>>>>>>>> >>>>>>>>>>> 001 >>>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log >>>>>>>>>>> SEVERE: Error uploading: >>>>>>>>>> org.apache.solr.client.solrj.SolrServerException: >>>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1 >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) >>>>>>>>>>> at >>>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) >>>>>>>>>>> ... 5 lines omitted ... >>>>>>>>>>> at java.lang.Thread.run(Unknown Source) >>>>>>>>>>> Caused by: java.net.SocketException: Connection reset >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> After sometime, all the three servers are going down. >>>>>>>>>>> >>>>>>>>>>> Appreciate, if someone could let us know what we are missing. >>>>>>>>>>> >>>>>>>>>>> Thx,Sudhakar. >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> <logs_error.txt> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >>