[
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Potter updated SOLR-4260:
---------------------------------
Attachment: demo_shard1_replicas_out_of_sync.tgz
While doing some other testing of SolrCloud (branch4x - 4.7-SNAPSHOT rev.
1556055), I hit this issue and here's the kicker ... there were no errors in my
replica's log, the tlogs are identical, and there was no significant GC
activity during the time where the replica got out of sync with the leader. I'm
attaching the data directories (index + tlog) for both replicas
(demo_shard1_replica1 [leader], and demo_shard1_replica2) and their log files.
When I do a doc-by-doc comparison of the two indexes, here's the result:
>> finished querying replica1, found 33537 documents (33537)
>> finished querying replica2, found 33528 documents
Doc [82995] not found in replica2: <doc boost="1.0"><field
name="id">82995</field><field name="string_s">test</field><field
name="int_i">-274468088</field><field name="float_f">0.90338105</field><field
name="double_d">0.6949391474539932</field><field name="text_en">this is a
test</field><field name="_version_">1456683668206518274</field></doc>
Doc [82997] not found in replica2: <doc boost="1.0"><field
name="id">82997</field><field name="string_s">test</field><field
name="int_i">301737117</field><field name="float_f">0.6746266</field><field
name="double_d">0.26034065188918565</field><field name="text_en">this is a
test</field><field name="_version_">1456683668206518276</field></doc>
Doc [82996] not found in replica2: <doc boost="1.0"><field
name="id">82996</field><field name="string_s">test</field><field
name="int_i">-1768315588</field><field name="float_f">0.6641093</field><field
name="double_d">0.23708033183534993</field><field name="text_en">this is a
test</field><field name="_version_">1456683668206518275</field></doc>
Doc [82991] not found in replica2: <doc boost="1.0"><field
name="id">82991</field><field name="string_s">test</field><field
name="int_i">-2057280061</field><field name="float_f">0.27617514</field><field
name="double_d">0.7885214691953506</field><field name="text_en">this is a
test</field><field name="_version_">1456683668206518273</field></doc>
Doc [82987] not found in replica2: <doc boost="1.0"><field
name="id">82987</field><field name="string_s">test</field><field
name="int_i">1051456320</field><field name="float_f">0.51863414</field><field
name="double_d">0.7881255443862878</field><field name="text_en">this is a
test</field><field name="_version_">1456683668206518272</field></doc>
Doc [82986] not found in replica2: <doc boost="1.0"><field
name="id">82986</field><field name="string_s">test</field><field
name="int_i">-1356807889</field><field name="float_f">0.2762279</field><field
name="double_d">0.003657816979820372</field><field name="text_en">this is a
test</field><field name="_version_">1456683668205469699</field></doc>
Doc [82984] not found in replica2: <doc boost="1.0"><field
name="id">82984</field><field name="string_s">test</field><field
name="int_i">732678870</field><field name="float_f">0.31199205</field><field
name="double_d">0.9848865821766198</field><field name="text_en">this is a
test</field><field name="_version_">1456683668205469698</field></doc>
Doc [82970] not found in replica2: <doc boost="1.0"><field
name="id">82970</field><field name="string_s">test</field><field
name="int_i">283693979</field><field name="float_f">0.6119651</field><field
name="double_d">0.04142006867388914</field><field name="text_en">this is a
test</field><field name="_version_">1456683668205469696</field></doc>
Doc [82973] not found in replica2: <doc boost="1.0"><field
name="id">82973</field><field name="string_s">test</field><field
name="int_i">1343103920</field><field name="float_f">0.5855809</field><field
name="double_d">0.6575904716584224</field><field name="text_en">this is a
test</field><field name="_version_">1456683668205469697</field></doc>
No amount of committing or reloading of these cores helps. Also, restarting
replica2 doesn't lead to it being in-sync either, most likely because the tlog
is identical to the leader? Here's the log messages on replica2 after
restarting it.
2014-01-08 13:28:20,112 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore
- [demo_shard1_replica2] Registered new searcher Searcher@4345de8a
main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356
_f(4.7):C3381)}
2014-01-08 13:28:21,298 [RecoveryThread] INFO solr.cloud.RecoveryStrategy -
Attempting to PeerSync from
http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/
core=demo_shard1_replica2 - recoveringAfterStartup=true
2014-01-08 13:28:21,302 [RecoveryThread] INFO solr.update.PeerSync -
PeerSync: core=demo_shard1_replica2
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr START
replicas=[http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/]
nUpdates=100
2014-01-08 13:28:21,330 [RecoveryThread] INFO solr.update.PeerSync -
PeerSync: core=demo_shard1_replica2
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Received 99
versions from
ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/
2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync -
PeerSync: core=demo_shard1_replica2
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Our versions
are newer. ourLowThreshold=1456683689417113603 otherHigh=1456683689602711553
2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync -
PeerSync: core=demo_shard1_replica2
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr DONE. sync
succeeded
2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.UpdateHandler -
start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.LoggingInfoStream -
[DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
2014-01-08 13:28:21,333 [RecoveryThread] INFO solr.update.UpdateHandler - No
uncommitted changes. Skipping IW.commit.
2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream -
[DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream -
[IW][RecoveryThread]: nrtIsCurrent: infoVersion matches: true; DW changes:
false; BD changes: false
2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.update.LoggingInfoStream -
[DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.search.SolrIndexSearcher -
Opening Searcher@5fc2a9d main
2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore
- QuerySenderListener sending requests to Searcher@5fc2a9d
main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356
_f(4.7):C3381)}
2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore
- QuerySenderListener done.
2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore
- [demo_shard1_replica2] Registered new searcher Searcher@5fc2a9d
main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356
_f(4.7):C3381)}
2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.update.UpdateHandler -
end_commit_flush
2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.RecoveryStrategy -
PeerSync Recovery was successful - registering as Active.
core=demo_shard1_replica2
2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.ZkController -
publishing core=demo_shard1_replica2 state=active
2014-01-08 13:28:21,370 [main-EventThread] INFO common.cloud.ZkStateReader -
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 6)
Thus, it would seem there might be some code that's outright losing documents
(almost feels like a last batch not flushed error but more subtle as it's not
easy to reproduce this all the time)
> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
> Reporter: Markus Jelsma
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png,
> 192.168.20.104-replica2.png, clusterstate.png,
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using
> CloudSolrServer we see inconsistencies between the leader and replica for
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have
> a small deviation in then number of documents. The leader and slave deviate
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my
> attention, there were small IDF differences for exactly the same record
> causing a record to shift positions in the result set. During those tests no
> records were indexed. Consecutive catch all queries also return different
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor
> of two and frequently reindex using a fresh build from trunk. I've not seen
> this issue for quite some time until a few days ago.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]