no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior
On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <jej2...@gmail.com> wrote: > Since I don't have that many items in my index I exported all of the keys > for each shard and wrote a simple java program that checks for duplicates. > I found some duplicate keys on different shards, a grep of the files for > the keys found does indicate that they made it to the wrong places. If you > notice documents with the same ID are on shard 3 and shard 5. Is it > possible that the hash is being calculated taking into account only the > "live" nodes? I know that we don't specify the numShards param @ startup > so could this be what is happening? > > grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" * > shard1-core1:0 > shard1-core2:0 > shard2-core1:0 > shard2-core2:0 > shard3-core1:1 > shard3-core2:1 > shard4-core1:0 > shard4-core2:0 > shard5-core1:1 > shard5-core2:1 > shard6-core1:0 > shard6-core2:0 > > > On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <jej2...@gmail.com> wrote: > >> Something interesting that I'm noticing as well, I just indexed 300,000 >> items, and some how 300,020 ended up in the index. I thought perhaps I >> messed something up so I started the indexing again and indexed another >> 400,000 and I see 400,064 docs. Is there a good way to find possibile >> duplicates? I had tried to facet on key (our id field) but that didn't >> give me anything with more than a count of 1. >> >> >> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <jej2...@gmail.com> wrote: >> >>> Ok, so clearing the transaction log allowed things to go again. I am >>> going to clear the index and try to replicate the problem on 4.2.0 and then >>> I'll try on 4.2.1 >>> >>> >>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <markrmil...@gmail.com>wrote: >>> >>>> No, not that I know if, which is why I say we need to get to the bottom >>>> of it. >>>> >>>> - Mark >>>> >>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>>> >>>> > Mark >>>> > It's there a particular jira issue that you think may address this? I >>>> read >>>> > through it quickly but didn't see one that jumped out >>>> > On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: >>>> > >>>> >> I brought the bad one down and back up and it did nothing. I can >>>> clear >>>> >> the index and try4.2.1. I will save off the logs and see if there is >>>> >> anything else odd >>>> >> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmil...@gmail.com> wrote: >>>> >> >>>> >>> It would appear it's a bug given what you have said. >>>> >>> >>>> >>> Any other exceptions would be useful. Might be best to start >>>> tracking in >>>> >>> a JIRA issue as well. >>>> >>> >>>> >>> To fix, I'd bring the behind node down and back again. >>>> >>> >>>> >>> Unfortunately, I'm pressed for time, but we really need to get to >>>> the >>>> >>> bottom of this and fix it, or determine if it's fixed in 4.2.1 >>>> (spreading >>>> >>> to mirrors now). >>>> >>> >>>> >>> - Mark >>>> >>> >>>> >>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2...@gmail.com> >>>> wrote: >>>> >>> >>>> >>>> Sorry I didn't ask the obvious question. Is there anything else >>>> that I >>>> >>>> should be looking for here and is this a bug? I'd be happy to >>>> troll >>>> >>>> through the logs further if more information is needed, just let me >>>> >>> know. >>>> >>>> >>>> >>>> Also what is the most appropriate mechanism to fix this. Is it >>>> >>> required to >>>> >>>> kill the index that is out of sync and let solr resync things? >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2...@gmail.com> >>>> >>> wrote: >>>> >>>> >>>> >>>>> sorry for spamming here.... >>>> >>>>> >>>> >>>>> shard5-core2 is the instance we're having issues with... >>>> >>>>> >>>> >>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>>> >>>>> SEVERE: shard update error StdNode: >>>> >>>>> >>>> >>> >>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException >>>> >>> : >>>> >>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned >>>> non >>>> >>> ok >>>> >>>>> status:503, message:Service Unavailable >>>> >>>>> at >>>> >>>>> >>>> >>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) >>>> >>>>> at >>>> >>>>> >>>> >>> >>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) >>>> >>>>> at >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) >>>> >>>>> at >>>> >>>>> >>>> >>> >>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) >>>> >>>>> at >>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>> >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>> >>>>> at >>>> >>>>> >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) >>>> >>>>> at >>>> >>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>>> >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>> >>>>> at >>>> >>>>> >>>> >>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>> >>>>> at >>>> >>>>> >>>> >>> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>> >>>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>>> >>>> >>>>> >>>> >>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2...@gmail.com> >>>> >>> wrote: >>>> >>>>> >>>> >>>>>> here is another one that looks interesting >>>> >>>>>> >>>> >>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log >>>> >>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says >>>> we are >>>> >>>>>> the leader, but locally we don't think so >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) >>>> >>>>>> at >>>> >>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>>> >>>>>> at >>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) >>>> >>>>>> at >>>> >>>>>> >>>> >>> >>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2...@gmail.com >>>> > >>>> >>> wrote: >>>> >>>>>> >>>> >>>>>>> Looking at the master it looks like at some point there were >>>> shards >>>> >>> that >>>> >>>>>>> went down. I am seeing things like what is below. >>>> >>>>>>> >>>> >>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected >>>> >>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred - >>>> >>> updating... (live >>>> >>>>>>> nodes size: 12) >>>> >>>>>>> Apr 2, 2013 8:12:52 PM >>>> org.apache.solr.common.cloud.ZkStateReader$3 >>>> >>>>>>> process >>>> >>>>>>> INFO: Updating live nodes... (9) >>>> >>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>> runLeaderProcess >>>> >>>>>>> INFO: Running the leader process. >>>> >>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>> shouldIBeLeader >>>> >>>>>>> INFO: Checking if I should try and be the leader. >>>> >>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>> shouldIBeLeader >>>> >>>>>>> INFO: My last published State was Active, it's okay to be the >>>> leader. >>>> >>>>>>> Apr 2, 2013 8:12:52 PM >>>> >>> org.apache.solr.cloud.ShardLeaderElectionContext >>>> >>>>>>> runLeaderProcess >>>> >>>>>>> INFO: I may be the new leader - try and sync >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller < >>>> markrmil...@gmail.com >>>> >>>> wrote: >>>> >>>>>>> >>>> >>>>>>>> I don't think the versions you are thinking of apply here. >>>> Peersync >>>> >>>>>>>> does not look at that - it looks at version numbers for >>>> updates in >>>> >>> the >>>> >>>>>>>> transaction log - it compares the last 100 of them on leader >>>> and >>>> >>> replica. >>>> >>>>>>>> What it's saying is that the replica seems to have versions >>>> that >>>> >>> the leader >>>> >>>>>>>> does not. Have you scanned the logs for any interesting >>>> exceptions? >>>> >>>>>>>> >>>> >>>>>>>> Did the leader change during the heavy indexing? Did any zk >>>> session >>>> >>>>>>>> timeouts occur? >>>> >>>>>>>> >>>> >>>>>>>> - Mark >>>> >>>>>>>> >>>> >>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2...@gmail.com> >>>> >>> wrote: >>>> >>>>>>>> >>>> >>>>>>>>> I am currently looking at moving our Solr cluster to 4.2 and >>>> >>> noticed a >>>> >>>>>>>>> strange issue while testing today. Specifically the replica >>>> has a >>>> >>>>>>>> higher >>>> >>>>>>>>> version than the master which is causing the index to not >>>> >>> replicate. >>>> >>>>>>>>> Because of this the replica has fewer documents than the >>>> master. >>>> >>> What >>>> >>>>>>>>> could cause this and how can I resolve it short of taking >>>> down the >>>> >>>>>>>> index >>>> >>>>>>>>> and scping the right version in? >>>> >>>>>>>>> >>>> >>>>>>>>> MASTER: >>>> >>>>>>>>> Last Modified:about an hour ago >>>> >>>>>>>>> Num Docs:164880 >>>> >>>>>>>>> Max Doc:164880 >>>> >>>>>>>>> Deleted Docs:0 >>>> >>>>>>>>> Version:2387 >>>> >>>>>>>>> Segment Count:23 >>>> >>>>>>>>> >>>> >>>>>>>>> REPLICA: >>>> >>>>>>>>> Last Modified: about an hour ago >>>> >>>>>>>>> Num Docs:164773 >>>> >>>>>>>>> Max Doc:164773 >>>> >>>>>>>>> Deleted Docs:0 >>>> >>>>>>>>> Version:3001 >>>> >>>>>>>>> Segment Count:30 >>>> >>>>>>>>> >>>> >>>>>>>>> in the replicas log it says this: >>>> >>>>>>>>> >>>> >>>>>>>>> INFO: Creating new http client, >>>> >>>>>>>>> >>>> >>>>>>>> >>>> >>> >>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false >>>> >>>>>>>>> >>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>>> >>>>>>>>> >>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>> >>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[ >>>> >>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100 >>>> >>>>>>>>> >>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>>> >>> handleVersions >>>> >>>>>>>>> >>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>> >>>>>>>> http://10.38.33.17:7577/solr >>>> >>>>>>>>> Received 100 versions from >>>> 10.38.33.16:7575/solr/dsc-shard5-core1/ >>>> >>>>>>>>> >>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync >>>> >>> handleVersions >>>> >>>>>>>>> >>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url= >>>> >>>>>>>> http://10.38.33.17:7577/solr Our >>>> >>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944 >>>> >>>>>>>>> otherHigh=1431233789440294912 >>>> >>>>>>>>> >>>> >>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync sync >>>> >>>>>>>>> >>>> >>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 >>>> >>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> which again seems to point that it thinks it has a newer >>>> version of >>>> >>>>>>>> the >>>> >>>>>>>>> index so it aborts. This happened while having 10 threads >>>> indexing >>>> >>>>>>>> 10,000 >>>> >>>>>>>>> items writing to a 6 shard (1 replica each) cluster. Any >>>> thoughts >>>> >>> on >>>> >>>>>>>> this >>>> >>>>>>>>> or what I should look for would be appreciated. >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>> >>>> >>>>>> >>>> >>>>> >>>> >>> >>>> >>> >>>> >>>> >>> >> >