Re: SolrCloud - ClusterState says we are the leader,but locally ...

Mark Miller Sun, 03 Feb 2013 11:56:28 -0800

What led you to trying that? I'm not connecting the dots in my head - the 
exception and the solution.


- Mark

On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote:

> Hi,
> 
> I think the issue was not in zk client timeout, but POST request size. When
> I increased the value for Request.maxFormContentSize in jetty.xml I don't
> see this issue any more.
> 
> Regards.
> 
> On 3 February 2013 01:56, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Do you see anything about session expiration in the logs? That is the
>> likely culprit for something like this. You may need to raise the timeout:
>> http://wiki.apache.org/solr/SolrCloud#FAQ
>> 
>> If you see no session timeouts, I don't have a guess yet.
>> 
>> - Mark
>> 
>> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mrzewu...@gmail.com> wrote:
>> 
>>> I'm experiencing same problem in Solr4.1 during bulk loading. After 50
>>> minutes of indexing the following error starts to occur:
>>> 
>>> INFO: [core] webapp=/solr path=/update params={} {} 0 4
>>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>> the
>>> leader, but locally we don't think so
>>>       at
>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
>>>       at
>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
>>>       at
>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
>>>       at
>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>       at
>>> 
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
>>>       at
>>> 
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
>>>       at
>>> 
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
>>>       at
>>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
>>>       at
>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>       at
>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>       at
>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>>>       at
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
>>>       at
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
>>>       at
>>> 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>>>       at
>>> 
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>>       at
>>> 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>>>       at
>>> 
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>>>       at
>>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>>>       at
>>> 
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>>>       at org.eclipse.jetty.server.Server.handle(Server.java:365)
>>>       at
>>> 
>> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
>>>       at
>>> 
>> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>>>       at
>>> 
>> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
>>>       at
>>> 
>> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
>>>       at
>> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
>>>       at
>>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>>>       at
>>> 
>> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>>>       at
>>> 
>> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>>>       at
>>> 
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>>>       at
>>> 
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>>>       at java.lang.Thread.run(Unknown Source)
>>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
>>> Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=50699
>>> 
>>> Then leader tries to sync with replica and after it finishes I can
>> continue
>>> loading.
>>> None of SolrCloud nodes was restarted during that time. I don't remember
>>> such behaviour in Solr4.0. Could it be related with the number of fields
>>> indexed during loading ? I have a collection with about 2400 fields. I
>>> can't reproduce same issue for other collections with much less fields
>> per
>>> record.
>>> Regards.
>>> 
>>> On 11 December 2012 19:50, Sudhakar Maddineni <maddineni...@gmail.com
>>> wrote:
>>> 
>>>> Just an update on this issue:
>>>>  We tried by increasing zookeeper client timeout settings to 30000ms in
>>>> solr.xml (i think default is 15000ms), and haven't seen any issues from
>> our
>>>> tests.
>>>> <cores .........           zkClientTimeout="30000" >
>>>> 
>>>> Thanks, Sudhakar.
>>>> 
>>>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
>>>> <maddineni...@gmail.com>wrote:
>>>> 
>>>>> We saw this error again today during our load test - basically,
>> whenever
>>>>> session is getting expired on the leader node, we are seeing the
>>>>> error.After this happens, leader(001) is going into 'recovery' mode and
>>>> all
>>>>> the index updates are failing with "503- service unavailable" error
>>>>> message.After some time(once recovery is successful), roles are swapped
>>>>> i.e. 001 acting as the replica and 003 as leader.
>>>>> 
>>>>> Btw, do you know why the connection to zookeeper[solr->zk] getting
>>>>> interrupted in the middle?
>>>>> is it because of the load(no of updates) we are putting on the cluster?
>>>>> 
>>>>> Here is the exception stack trace:
>>>>> 
>>>>> *Dec* *7*, *2012* *2:28:03* *PM*
>>>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
>>>>> *WARNING:*
>>>> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
>>>> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
>>>>>     *at*
>>>> 
>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
>>>>>     *at*
>>>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
>>>>>     *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
>>>>> )
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
>>>>>     *at* *java.lang.Thread.run*(*Unknown* *Source*)
>>>>> 
>>>>> Thx,Sudhakar.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
>>>> maddineni...@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> Erick:
>>>>>> Not seeing any page caching related issues...
>>>>>> 
>>>>>> Mark:
>>>>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in
>>>> the
>>>>>> zookeeper cluster state? I was also looking at the leader(001) logs at
>>>> that
>>>>>> time and seeing errors related to "*SEVERE: ClusterState says we are
>> the
>>>>>> leader, but locally we don't think so*".
>>>>>> 2.Also, all of our servers in cluster were gone down when the index
>>>>>> updates were running in parallel along with this issue.Do you see this
>>>>>> related to the session expiry on 001?
>>>>>> 
>>>>>> Here are the logs on 001
>>>>>> =========================
>>>>>> 
>>>>>> Dec 4, 2012 12:12:29 PM
>>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>>>> WARNING:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /overseer_elect/leader
>>>>>> at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>>>>>> at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>>>>>> Dec 4, 2012 12:12:29 PM
>>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>>>> INFO: According to ZK I
>>>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
>>>> leader.
>>>>>> 
>>>>>> Dec 4, 2012 12:12:29 PM
>>>> org.apache.solr.cloud.OverseerCollectionProcessor
>>>>>> run
>>>>>> WARNING: Overseer cannot talk to ZK
>>>>>> 
>>>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem finding the leader in
>>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem finding the leader in
>>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem making a request to the
>>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>>>> state
>>>>>> down for <001>:8080_solr but I still do not see the request state. I
>> see
>>>>>> state: active live:true
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem making a request to the
>>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>>>> state
>>>>>> down for <001>:8080_solr but I still do not see the request state. I
>> see
>>>>>> state: active live:true
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>> ....
>>>>>> ....
>>>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem finding the leader in
>>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>>>> ....
>>>>>> ....
>>>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
>>>>>> finding the leader in zk
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>>>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
>>>>>> SEVERE: Error getting leader from zk
>>>>>> org.apache.solr.common.SolrException: *There is conflicting
>> information
>>>>>> about the leader of shard: shard1 our state says:http://
>>>> <001>:8080/solr/core1/
>>>>>> but zookeeper says:http://<003>:8080/solr/core1/*
>>>>>> * at
>>>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
>>>>>> * at
>> org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>>>>>> Dec 4, 2012 12:22:30 PM
>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>>>>>> INFO: Running the leader process.
>>>>>> ....
>>>>>> ....
>>>>>> 
>>>>>> Thanks for your inputs.
>>>>>> Sudhakar.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmil...@gmail.com
>>>>> wrote:
>>>>>> 
>>>>>>> Yes - it means that 001 went down (or more likely had it's connection
>>>> to
>>>>>>> ZooKeeper interrupted? that's what I mean about a session timeout -
>> if
>>>> the
>>>>>>> solr->zk link is broken for longer than the session timeout that will
>>>>>>> trigger a leader election and when the connection is reestablished,
>> the
>>>>>>> node will have to recover). That waiting should stop as soon as 001
>>>> came
>>>>>>> back up or reconnected to ZooKeeper.
>>>>>>> 
>>>>>>> In fact, this waiting should not happen in this case - but only on
>>>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
>>>> very
>>>>>>> soon!):
>>>>>>> 
>>>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
>>>> code
>>>>>>> path
>>>>>>> for a fresh cluster start rather than fail over. (Mark Miller)
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <
>> maddineni...@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Yep, after restarting, cluster came back to normal state.We will run
>>>>>>> couple of more tests and see if we could reproduce this issue.
>>>>>>>> 
>>>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until
>>>>>>> we see more replicas'  message.From the logs, we can see that leader
>>>>>>> election process started on 003 which was the replica for 001
>>>>>>> initially.That means leader 001 went down at that time?
>>>>>>>> 
>>>>>>>> logs on 003:
>>>>>>>> ========
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>>>       INFO: Running the leader process.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>>>       INFO: Checking if I should try and be the leader.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>>>       INFO: My last published State was Active, it's okay to be the
>>>>>>> leader.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>>>       INFO: I may be the new leader - try and sync
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>>>>>>>>       WARNING: Stopping recovery for
>>>> zkNodeName=<003>:8080_solr_core
>>>>>>> core=core1.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>>>>>>>>       INFO: Sync replicas to http://<003>:8080/solr/core1/
>>>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
>>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>>>>>>>>       INFO: Updating live nodes -> this message is on 002
>>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>>>>>>>>       WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>>>>>> exception talking to <001>:8080/solr/core1/, failed
>>>>>>>>       org.apache.solr.client.solrj.SolrServerException: Timeout
>>>>>>> occured while waiting response from server at: <001>:8080/solr/core1
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>>>>>>>             at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>>>> Source)
>>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>>>             at
>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>>>>>>>             at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>>>> Source)
>>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>>>             at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
>> Source)
>>>>>>>>             at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>>>>             at java.lang.Thread.run(Unknown Source)
>>>>>>>>       Caused by: java.net.SocketTimeoutException: Read timed out
>>>>>>>>             at java.net.SocketInputStream.socketRead0(Native
>>>> Method)
>>>>>>>>             at java.net.SocketInputStream.read(Unknown Source)
>>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>>>>>>> sync failed
>>>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
>>>>>>>>       SEVERE: Sync Failed
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> rejoinLeaderElection
>>>>>>>>       INFO: There is a better leader candidate than us - going back
>>>>>>> into recovery
>>>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>>>>>>>>       INFO: Running recovery - first canceling any ongoing recovery
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>>>>>>>>       INFO: Starting recovery process.  core=core1
>>>>>>> recoveringAfterStartup=false
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>>>>>>>       INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>>>>>>> core=core1 - recoveringAfterStartup=false
>>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>>>       INFO: Running the leader process.
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> waitForReplicasToComeUp
>>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>> timeoutin=179999
>>>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> waitForReplicasToComeUp
>>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>> timeoutin=179495
>>>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> waitForReplicasToComeUp
>>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>> timeoutin=178985
>>>>>>>> ....
>>>>>>>> ....
>>>>>>>> 
>>>>>>>> Thanks for your help.
>>>>>>>> Sudhakar.
>>>>>>>> 
>>>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <markrmil...@gmail.com>
>>>>>>> wrote:
>>>>>>>> The waiting logging had to happen on restart unless it's some kind
>> of
>>>>>>> bug.
>>>>>>>> 
>>>>>>>> Beyond that, something is off, but I have no clue why - it seems
>> your
>>>>>>> clusterstate.json is not up to date at all.
>>>>>>>> 
>>>>>>>> Have you tried restarting the cluster then? Does that help at all?
>>>>>>>> 
>>>>>>>> Do you see any exceptions around zookeeper session timeouts?
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
>>>> maddineni...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hey Mark,
>>>>>>>>> 
>>>>>>>>> Yes, I am able to access all of the nodes under each shard from
>>>>>>> solrcloud
>>>>>>>>> admin UI.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> - *It kind of looks like the urls solrcloud is using are not
>>>>>>> accessible.
>>>>>>>>> When you go to the admin page and the cloud tab, can you access
>>>>>>> the urls it
>>>>>>>>> shows for each shard? That is, if you click on of the links or
>>>>>>> copy and
>>>>>>>>> paste the address into a web browser, does it work?*
>>>>>>>>> 
>>>>>>>>> Actually, I got these errors when my document upload task/job was
>>>>>>> running,
>>>>>>>>> not during the cluster restart. Also,job ran fine initially for the
>>>>>>> first
>>>>>>>>> one hour and started throwing these errors after indexing some
>>>> docx.
>>>>>>>>> 
>>>>>>>>> Thx, Sudhakar.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmil...@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> It kind of looks like the urls solrcloud is using are not
>>>>>>> accessible. When
>>>>>>>>>> you go to the admin page and the cloud tab, can you access the
>>>> urls
>>>>>>> it
>>>>>>>>>> shows for each shard? That is, if you click on of the links or
>>>> copy
>>>>>>> and
>>>>>>>>>> paste the address into a web browser, does it work?
>>>>>>>>>> 
>>>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
>>>>>>> auto
>>>>>>>>>> detecting the right one. Make sure the ports like right too.
>>>>>>>>>> 
>>>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>>>> timeoutin=179999
>>>>>>>>>> 
>>>>>>>>>> That happens when you stop the cluster and try to start it again -
>>>>>>> before
>>>>>>>>>> a leader is chosen, it will wait for all known replicas fora shard
>>>>>>> to come
>>>>>>>>>> up so that everyone can sync up and have a chance to be the best
>>>>>>> leader. So
>>>>>>>>>> at this point it was only finding one of 2 known replicas and
>>>>>>> waiting for
>>>>>>>>>> the second to come up. After a couple minutes (configurable) it
>>>>>>> will just
>>>>>>>>>> continue anyway without the missing replica (if it doesn't show
>>>> up).
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>>>>>>> maddineni...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> We are uploading solr documents to the index in batches using 30
>>>>>>> threads
>>>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>>>>>>> set to
>>>>>>>>>>> 10000.
>>>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method
>>>>>>> to add
>>>>>>>>>>> docx.
>>>>>>>>>>> And, we have the following commit settings in solrconfig:
>>>>>>>>>>> 
>>>>>>>>>>>  <autoCommit>
>>>>>>>>>>>    <maxTime>300000</maxTime>
>>>>>>>>>>>    <maxDocs>10000</maxDocs>
>>>>>>>>>>>    <openSearcher>false</openSearcher>
>>>>>>>>>>>  </autoCommit>
>>>>>>>>>>> 
>>>>>>>>>>>    <autoSoftCommit>
>>>>>>>>>>>      <maxTime>1000</maxTime>
>>>>>>>>>>>    </autoSoftCommit>
>>>>>>>>>>> 
>>>>>>>>>>> Cluster Details:
>>>>>>>>>>> ----------------------------
>>>>>>>>>>> solr version - 4.0
>>>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>>>>>>>>>> numshards=2 ,
>>>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
>>>>>>>>>>> loadbalancer  <vip>
>>>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Logs:Getting the errors in the below sequence after uploading
>>>> some
>>>>>>> docx:
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> -----------------------------------------------------------------------------------------------------------
>>>>>>>>>>> 003
>>>>>>>>>>> Dec 4, 2012 12:11:46 PM
>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>>>> timeoutin=179999
>>>>>>>>>>> 
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:12:59 PM
>>>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>>>>>>>>>> doDefensiveChecks
>>>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't
>>>>>>> think
>>>>>>>>>> so
>>>>>>>>>>> 
>>>>>>>>>>> 003
>>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>>>>>>> retrying ...
>>>>>>>>>>> 
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>>>>>>> Server at
>>>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
>>>>>>> Unavailable
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: Error while trying to recover.
>>>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
>>>>>>> leader
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>>>>>>> 
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: Error uploading:
>>>>>>>>>> org.apache.solr.client.solrj.SolrServerException:
>>>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>>>>>>>> at
>>>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>>>>>>>>>> ... 5 lines omitted ...
>>>>>>>>>>> at java.lang.Thread.run(Unknown Source)
>>>>>>>>>>> Caused by: java.net.SocketException: Connection reset
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> After sometime, all the three servers are going down.
>>>>>>>>>>> 
>>>>>>>>>>> Appreciate, if someone could let us know what we are missing.
>>>>>>>>>>> 
>>>>>>>>>>> Thx,Sudhakar.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> <logs_error.txt>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Reply via email to