Sounds like you need to raise your ZooKeeper connection timeout.
Also, make sure you are using a concurrent garbage collector as a side
note - stop the world pauses should be avoided. Just good advice :)
- Mark
On Dec 18, 2013, at 5:48 AM, Anca Kopetz<anca.kop...@kelkoo.com> wrote:
Hi,
In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from
time to time into recovering state, and it takes more than 10 minutes to
finish to recover.
In logs, we see that "PeerSync Recovery" fails with the message :
PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many
updates received since start - startingUpdates no longer overlaps with our
currentUpdates
Then "Replication Recovery" starts.
Is there something we can do to avoid the failure of "Peer Recovery" so
that the recovery process is more rapid (less than 10 minutes) ?
The full trace log is here :
2013-12-05 13:51:53,740 [http-8080-46] INFO
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO
org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
- It has been requested that we recover
2013-12-05 13:51:53,740 [http-8080-112] INFO
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
QTime=0
2013-12-05 13:51:53,740 [Thread-1544] INFO
org.apache.solr.cloud.ZkController:publish:1017 - publishing
core=fr_green state=recovering
2013-12-05 13:51:53,741 [http-8080-46] INFO
org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 -
[admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
QTime=1
2013-12-05 13:51:53,740 [Thread-1543] INFO
org.apache.solr.cloud.ZkController:publish:1017 - publishing
core=fr_green state=recovering
2013-12-05 13:51:53,743 [Thread-1544] INFO
org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on
descriptor - reading it from system property
2013-12-05 13:51:53,746 [Thread-1543] INFO
org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on
descriptor - reading it from system property
2013-12-05 13:51:53,755 [Thread-1543] WARN
org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for
zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
2013-12-05 13:51:53,756 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery
process. core=fr_green recoveringAfterStartup=false
2013-12-05 13:51:53,762 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery
process. core=fr_green
2013-12-05 13:51:53,762 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery
process. core=fr_green recoveringAfterStartup=false
2013-12-05 13:51:53,765 [RecoveryThread] INFO
org.apache.solr.cloud.ZkController:publish:1017 - publishing
core=fr_green state=recovering
2013-12-05 13:51:53,765 [RecoveryThread] INFO
org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on
descriptor - reading it from system property
2013-12-05 13:51:53,767 [RecoveryThread] INFO
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 -
Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2013-12-05 13:51:54,777 [main-EventThread] INFO
org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:51:56,804 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to
PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
recoveringAfterStartup=false
2013-12-05 13:51:56,806 [RecoveryThread] WARN
org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green url=
http://solr-08/searchsolrnodefr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates
2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery
was not successful - trying replication. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting
Replication Recovery. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering
updates. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to
replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
2013-12-05 13:51:56,806 [RecoveryThread] INFO
org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 -
Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
2013-12-05 13:52:01,203 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:<init>:211 - No value set for
'pollInterval'. Timer Task not started.
2013-12-05 13:52:01,209 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:369 - Master's
generation: 4382
2013-12-05 13:52:01,209 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:370 - Slave's
generation: 4382
2013-12-05 13:52:01,209 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:371 - Starting
replication process
2013-12-05 13:52:01,213 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:376 - Number of files
in latest index in master: 158
2013-12-05 13:52:01,213 [RecoveryThread] INFO
org.apache.solr.core.CachingDirectoryFactory:get:360 - return new
directory for
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
2013-12-05 13:52:01,213 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:402 - Starting
download to
org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7fullCopy=true
2013-12-05 13:52:21,970 [main-EventThread] INFO
org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:55:15,039 [main-EventThread] INFO
org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 13:56:27,160 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:406 - Total time
taken for download : 265 secs
2013-12-05 13:56:27,540 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:modifyIndexProps:886 - New index
installed. Updating index properties... index=index.20131205135201213
2013-12-05 13:56:27,546 [RecoveryThread] INFO
org.apache.solr.handler.SnapPuller:fetchLatestIndex:461 - removing old
index directory
org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838
2013-12-05 13:56:27,765 [RecoveryThread] INFO
org.apache.solr.search.SolrIndexSearcher:<init>:197 - Opening
Searcher@30a544f3 main
2013-12-05 13:58:52,337 [RecoveryThread] INFO
org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235 -
looking to close
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>]
2013-12-05 13:58:52,337 [RecoveryThread] INFO
org.apache.solr.core.CachingDirectoryFactory:close:304 - Closing
directory:
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
2013-12-05 13:58:52,337 [RecoveryThread] INFO
org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279 -
Removing directory before core close:
/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
2013-12-05 13:58:54,172 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:replay:506 - Replaying buffered
documents. core=fr_green
2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN
org.apache.solr.update.UpdateLog:doReplay:1240 - Starting log replay
tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091
refcount=2} active=true starting pos=1103833347
2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO
org.apache.solr.search.SolrIndexSearcher:<init>:197 - Opening
Searcher@5e48fb12 main
2013-12-05 14:01:11,908 [main-EventThread] INFO
org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state
change: WatchedEvent state:SyncConnected type:NodeDataChanged
path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN
org.apache.solr.update.UpdateLog:run:1230 - Log replay finished.
recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0
positionOfStart=1103833347}
2013-12-05 14:01:22,846 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:410 - Replication
Recovery was successful - registering as Active. core=fr_green
2013-12-05 14:01:22,846 [RecoveryThread] INFO
org.apache.solr.cloud.ZkController:publish:1017 - publishing
core=fr_green state=active
2013-12-05 14:01:22,850 [RecoveryThread] INFO
org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on
descriptor - reading it from system property
2013-12-05 14:01:22,860 [RecoveryThread] INFO
org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery
process. core=fr_green
Best regards,
Anca Kopetz
--
<generic-logo-no-strap.gif>
Anca Kopetz Software engineer
E anca.kop...@kelkoo.com Y!Messenger kelkooancak
T +33 (0)4 56 09 07 55
A 4/6 Rue des Meridiens 38130 Echirolles
<x-bottom-left.gif> <x-bottom-middle.gif>
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris
Ce message et les pièces jointes sont confidentiels et établis à
l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'en avertir
l'expéditeur.