Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Jamie Johnson Wed, 03 Apr 2013 10:13:41 -0700

no, my thought was wrong, it appears that even with the parameter set I am
seeing this behavior.  I've been able to duplicate it on 4.2.0 by indexing
100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so.
 I will try this on 4.2.1. to see if I see the same behavior



On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> wrote:

> Since I don't have that many items in my index I exported all of the keys
> for each shard and wrote a simple java program that checks for duplicates.
>  I found some duplicate keys on different shards, a grep of the files for
> the keys found does indicate that they made it to the wrong places.  If you
> notice documents with the same ID are on shard 3 and shard 5.  Is it
> possible that the hash is being calculated taking into account only the
> "live" nodes?  I know that we don't specify the numShards param @ startup
> so could this be what is happening?
>
> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> shard1-core1:0
> shard1-core2:0
> shard2-core1:0
> shard2-core2:0
> shard3-core1:1
> shard3-core2:1
> shard4-core1:0
> shard4-core2:0
> shard5-core1:1
> shard5-core2:1
> shard6-core1:0
> shard6-core2:0
>
>
> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>
>> Something interesting that I'm noticing as well, I just indexed 300,000
>> items, and some how 300,020 ended up in the index.  I thought perhaps I
>> messed something up so I started the indexing again and indexed another
>> 400,000 and I see 400,064 docs.  Is there a good way to find possibile
>> duplicates?  I had tried to facet on key (our id field) but that didn't
>> give me anything with more than a count of 1.
>>
>>
>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> wrote:
>>
>>> Ok, so clearing the transaction log allowed things to go again.  I am
>>> going to clear the index and try to replicate the problem on 4.2.0 and then
>>> I'll try on 4.2.1
>>>
>>>
>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com>wrote:
>>>
>>>> No, not that I know if, which is why I say we need to get to the bottom
>>>> of it.
>>>>
>>>> - Mark
>>>>
>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>
>>>> > Mark
>>>> > It's there a particular jira issue that you think may address this? I
>>>> read
>>>> > through it quickly but didn't see one that jumped out
>>>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
>>>> >
>>>> >> I brought the bad one down and back up and it did nothing.  I can
>>>> clear
>>>> >> the index and try4.2.1. I will save off the logs and see if there is
>>>> >> anything else odd
>>>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote:
>>>> >>
>>>> >>> It would appear it's a bug given what you have said.
>>>> >>>
>>>> >>> Any other exceptions would be useful. Might be best to start
>>>> tracking in
>>>> >>> a JIRA issue as well.
>>>> >>>
>>>> >>> To fix, I'd bring the behind node down and back again.
>>>> >>>
>>>> >>> Unfortunately, I'm pressed for time, but we really need to get to
>>>> the
>>>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1
>>>> (spreading
>>>> >>> to mirrors now).
>>>> >>>
>>>> >>> - Mark
>>>> >>>
>>>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>> Sorry I didn't ask the obvious question.  Is there anything else
>>>> that I
>>>> >>>> should be looking for here and is this a bug?  I'd be happy to
>>>> troll
>>>> >>>> through the logs further if more information is needed, just let me
>>>> >>> know.
>>>> >>>>
>>>> >>>> Also what is the most appropriate mechanism to fix this.  Is it
>>>> >>> required to
>>>> >>>> kill the index that is out of sync and let solr resync things?
>>>> >>>>
>>>> >>>>
>>>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>> sorry for spamming here....
>>>> >>>>>
>>>> >>>>> shard5-core2 is the instance we're having issues with...
>>>> >>>>>
>>>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>> >>>>> SEVERE: shard update error StdNode:
>>>> >>>>>
>>>> >>>
>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>> >>> :
>>>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned
>>>> non
>>>> >>> ok
>>>> >>>>> status:503, message:Service Unavailable
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>> >>>>>       at
>>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>       at
>>>> >>>>>
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>> >>>>>       at
>>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> >>>>>       at
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> >>>>>       at java.lang.Thread.run(Thread.java:662)
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com>
>>>> >>> wrote:
>>>> >>>>>
>>>> >>>>>> here is another one that looks interesting
>>>> >>>>>>
>>>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says
>>>> we are
>>>> >>>>>> the leader, but locally we don't think so
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>> >>>>>>       at
>>>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>> >>>>>>       at
>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>> >>>>>>       at
>>>> >>>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com
>>>> >
>>>> >>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Looking at the master it looks like at some point there were
>>>> shards
>>>> >>> that
>>>> >>>>>>> went down.  I am seeing things like what is below.
>>>> >>>>>>>
>>>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>>> >>> updating... (live
>>>> >>>>>>> nodes size: 12)
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>> >>>>>>> process
>>>> >>>>>>> INFO: Updating live nodes... (9)
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> runLeaderProcess
>>>> >>>>>>> INFO: Running the leader process.
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> shouldIBeLeader
>>>> >>>>>>> INFO: Checking if I should try and be the leader.
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> shouldIBeLeader
>>>> >>>>>>> INFO: My last published State was Active, it's okay to be the
>>>> leader.
>>>> >>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>> runLeaderProcess
>>>> >>>>>>> INFO: I may be the new leader - try and sync
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>> markrmil...@gmail.com
>>>> >>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> I don't think the versions you are thinking of apply here.
>>>> Peersync
>>>> >>>>>>>> does not look at that - it looks at version numbers for
>>>> updates in
>>>> >>> the
>>>> >>>>>>>> transaction log - it compares the last 100 of them on leader
>>>> and
>>>> >>> replica.
>>>> >>>>>>>> What it's saying is that the replica seems to have versions
>>>> that
>>>> >>> the leader
>>>> >>>>>>>> does not. Have you scanned the logs for any interesting
>>>> exceptions?
>>>> >>>>>>>>
>>>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk
>>>> session
>>>> >>>>>>>> timeouts occur?
>>>> >>>>>>>>
>>>> >>>>>>>> - Mark
>>>> >>>>>>>>
>>>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com>
>>>> >>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and
>>>> >>> noticed a
>>>> >>>>>>>>> strange issue while testing today.  Specifically the replica
>>>> has a
>>>> >>>>>>>> higher
>>>> >>>>>>>>> version than the master which is causing the index to not
>>>> >>> replicate.
>>>> >>>>>>>>> Because of this the replica has fewer documents than the
>>>> master.
>>>> >>> What
>>>> >>>>>>>>> could cause this and how can I resolve it short of taking
>>>> down the
>>>> >>>>>>>> index
>>>> >>>>>>>>> and scping the right version in?
>>>> >>>>>>>>>
>>>> >>>>>>>>> MASTER:
>>>> >>>>>>>>> Last Modified:about an hour ago
>>>> >>>>>>>>> Num Docs:164880
>>>> >>>>>>>>> Max Doc:164880
>>>> >>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>> Version:2387
>>>> >>>>>>>>> Segment Count:23
>>>> >>>>>>>>>
>>>> >>>>>>>>> REPLICA:
>>>> >>>>>>>>> Last Modified: about an hour ago
>>>> >>>>>>>>> Num Docs:164773
>>>> >>>>>>>>> Max Doc:164773
>>>> >>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>> Version:3001
>>>> >>>>>>>>> Segment Count:30
>>>> >>>>>>>>>
>>>> >>>>>>>>> in the replicas log it says this:
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: Creating new http client,
>>>> >>>>>>>>>
>>>> >>>>>>>>
>>>> >>>
>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>> >>> handleVersions
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>> http://10.38.33.17:7577/solr
>>>> >>>>>>>>> Received 100 versions from
>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>>> >>> handleVersions
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>> http://10.38.33.17:7577/solr  Our
>>>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>> >>>>>>>>> otherHigh=1431233789440294912
>>>> >>>>>>>>>
>>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync
>>>> >>>>>>>>>
>>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> which again seems to point that it thinks it has a newer
>>>> version of
>>>> >>>>>>>> the
>>>> >>>>>>>>> index so it aborts.  This happened while having 10 threads
>>>> indexing
>>>> >>>>>>>> 10,000
>>>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster.  Any
>>>> thoughts
>>>> >>> on
>>>> >>>>>>>> this
>>>> >>>>>>>>> or what I should look for would be appreciated.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>>
>>>>
>>>
>>
>

Re: Solr 4.2 Cloud Replication Replica has higher version than Master?

Reply via email to