I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote:
> It would appear it's a bug given what you have said. > > Any other exceptions would be useful. Might be best to start tracking in a > JIRA issue as well. > > To fix, I'd bring the behind node down and back again. > > Unfortunately, I'm pressed for time, but we really need to get to the > bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading > to mirrors now). > > - Mark > > On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> wrote: > > > Sorry I didn't ask the obvious question. Is there anything else that I > > should be looking for here and is this a bug? I'd be happy to troll > > through the logs further if more information is needed, just let me know. > > > > Also what is the most appropriate mechanism to fix this. Is it required > to > > kill the index that is out of sync and let solr resync things? > > > > > > On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com> wrote: > > > >> sorry for spamming here.... > >> > >> shard5-core2 is the instance we're having issues with... > >> > >> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log > >> SEVERE: shard update error StdNode: > >> > http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException > : > >> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok > >> status:503, message:Service Unavailable > >> at > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) > >> at > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >> at > >> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) > >> at > >> > org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) > >> at > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >> at > >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > >> at > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >> at java.lang.Thread.run(Thread.java:662) > >> > >> > >> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >> > >>> here is another one that looks interesting > >>> > >>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log > >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are > >>> the leader, but locally we don't think so > >>> at > >>> > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) > >>> at > >>> > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) > >>> at > >>> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) > >>> at > >>> > org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) > >>> at > >>> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) > >>> at > >>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) > >>> at > >>> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > >>> at > >>> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > >>> at > >>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) > >>> at > >>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) > >>> at > >>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) > >>> > >>> > >>> > >>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >>> > >>>> Looking at the master it looks like at some point there were shards > that > >>>> went down. I am seeing things like what is below. > >>>> > >>>> NFO: A cluster state change: WatchedEvent state:SyncConnected > >>>> type:NodeChildrenChanged path:/live_nodes, has occurred - updating... > (live > >>>> nodes size: 12) > >>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3 > >>>> process > >>>> INFO: Updating live nodes... (9) > >>>> Apr 2, 2013 8:12:52 PM > org.apache.solr.cloud.ShardLeaderElectionContext > >>>> runLeaderProcess > >>>> INFO: Running the leader process. > >>>> Apr 2, 2013 8:12:52 PM > org.apache.solr.cloud.ShardLeaderElectionContext > >>>> shouldIBeLeader > >>>> INFO: Checking if I should try and be the leader. > >>>> Apr 2, 2013 8:12:52 PM > org.apache.solr.cloud.ShardLeaderElectionContext > >>>> shouldIBeLeader > >>>> INFO: My last published State was Active, it's okay to be the leader. > >>>> Apr 2, 2013 8:12:52 PM > org.apache.solr.cloud.ShardLeaderElectionContext > >>>> runLeaderProcess > >>>> INFO: I may be the new leader - try and sync > >>>> > >>>> > >>>> > >>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmil...@gmail.com > >wrote: > >>>> > >>>>> I don't think the versions you are thinking of apply here. Peersync > >>>>> does not look at that - it looks at version numbers for updates in > the > >>>>> transaction log - it compares the last 100 of them on leader and > replica. > >>>>> What it's saying is that the replica seems to have versions that the > leader > >>>>> does not. Have you scanned the logs for any interesting exceptions? > >>>>> > >>>>> Did the leader change during the heavy indexing? Did any zk session > >>>>> timeouts occur? > >>>>> > >>>>> - Mark > >>>>> > >>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com> wrote: > >>>>> > >>>>>> I am currently looking at moving our Solr cluster to 4.2 and > noticed a > >>>>>> strange issue while testing today. Specifically the replica has a > >>>>> higher > >>>>>> version than the master which is causing the index to not replicate. > >>>>>> Because of this the replica has fewer documents than the master. > What > >>>>>> could cause this and how can I resolve it short of taking down the > >>>>> index > >>>>>> and scping the right version in? > >>>>>> > >>>>>> MASTER: > >>>>>> Last Modified:about an hour ago > >>>>>> Num Docs:164880 > >>>>>> Max Doc:164880 > >>>>>> Deleted Docs:0 > >>>>>> Version:2387 > >>>>>> Segment Count:23 > >>>>>> > >>>>>> REPLICA: > >>>>>> Last Modified: about an hour ago > >>>>>> Num Docs:164773 > >>>>>> Max Doc:164773 > >>>>>> Deleted Docs:0 > >>>>>> Version:3001 > >>>>>> Segment Count:30 > >>>>>> > >>>>>> in the replicas log it says this: > >>>>>> > >>>>>> INFO: Creating new http client, > >>>>>> > >>>>> > config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false > >>>>>> > >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync > >>>>>> > >>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ > >>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100 > >>>>>> > >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > handleVersions > >>>>>> > >>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>> http://10.38.33.17:7577/solr > >>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/ > >>>>>> > >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync > handleVersions > >>>>>> > >>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= > >>>>> http://10.38.33.17:7577/solr Our > >>>>>> versions are newer. ourLowThreshold=1431233788792274944 > >>>>>> otherHigh=1431233789440294912 > >>>>>> > >>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync > >>>>>> > >>>>>> INFO: PeerSync: core=dsc-shard5-core2 > >>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded > >>>>>> > >>>>>> > >>>>>> which again seems to point that it thinks it has a newer version of > >>>>> the > >>>>>> index so it aborts. This happened while having 10 threads indexing > >>>>> 10,000 > >>>>>> items writing to a 6 shard (1 replica each) cluster. Any thoughts > on > >>>>> this > >>>>>> or what I should look for would be appreciated. > >>>>> > >>>>> > >>>> > >>> > >> > >