Sounds like you need to raise your ZooKeeper connection timeout. Also, make sure you are using a concurrent garbage collector as a side note - stop the world pauses should be avoided. Just good advice :)
- Mark On Dec 18, 2013, at 5:48 AM, Anca Kopetz <anca.kop...@kelkoo.com> wrote: > Hi, > > In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from time to > time into recovering state, and it takes more than 10 minutes to finish to > recover. > > In logs, we see that "PeerSync Recovery" fails with the message : > > PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many updates > received since start - startingUpdates no longer overlaps with our > currentUpdates > > Then "Replication Recovery" starts. > > Is there something we can do to avoid the failure of "Peer Recovery" so that > the recovery process is more rapid (less than 10 minutes) ? > > The full trace log is here : > > 2013-12-05 13:51:53,740 [http-8080-46] INFO > org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 > - It has been requested that we recover > 2013-12-05 13:51:53,740 [http-8080-112] INFO > org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705 > - It has been requested that we recover > 2013-12-05 13:51:53,740 [http-8080-112] INFO > org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] > webapp=null path=/admin/cores > params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 > QTime=0 > 2013-12-05 13:51:53,740 [Thread-1544] INFO > org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green > state=recovering > 2013-12-05 13:51:53,741 [http-8080-46] INFO > org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658 - [admin] > webapp=null path=/admin/cores > params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0 > QTime=1 > 2013-12-05 13:51:53,740 [Thread-1543] INFO > org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green > state=recovering > 2013-12-05 13:51:53,743 [Thread-1544] INFO > org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on > descriptor - reading it from system property > 2013-12-05 13:51:53,746 [Thread-1543] INFO > org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on > descriptor - reading it from system property > 2013-12-05 13:51:53,755 [Thread-1543] WARN > org.apache.solr.cloud.RecoveryStrategy:close:105 - Stopping recovery for > zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green > 2013-12-05 13:51:53,756 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. > core=fr_green recoveringAfterStartup=false > 2013-12-05 13:51:53,762 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery > process. core=fr_green > 2013-12-05 13:51:53,762 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:run:216 - Starting recovery process. > core=fr_green recoveringAfterStartup=false > 2013-12-05 13:51:53,765 [RecoveryThread] INFO > org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green > state=recovering > 2013-12-05 13:51:53,765 [RecoveryThread] INFO > org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on > descriptor - reading it from system property > 2013-12-05 13:51:53,767 [RecoveryThread] INFO > org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating > new http client, > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > 2013-12-05 13:51:54,777 [main-EventThread] INFO > org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state > change: WatchedEvent state:SyncConnected type:NodeDataChanged > path:/clusterstate.json, has occurred - updating... (live nodes size: 18) > 2013-12-05 13:51:56,804 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:356 - Attempting to > PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green - > recoveringAfterStartup=false > 2013-12-05 13:51:56,806 [RecoveryThread] WARN > org.apache.solr.update.PeerSync:sync:232 - PeerSync: core=fr_green > url=http://solr-08/searchsolrnodefr too many updates received since start - > startingUpdates no longer overlaps with our currentUpdates > 2013-12-05 13:51:56,806 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:394 - PeerSync Recovery > was not successful - trying replication. core=fr_green > 2013-12-05 13:51:56,806 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:397 - Starting Replication > Recovery. core=fr_green > 2013-12-05 13:51:56,806 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:399 - Begin buffering > updates. core=fr_green > 2013-12-05 13:51:56,806 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:replicate:127 - Attempting to > replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green > 2013-12-05 13:51:56,806 [RecoveryThread] INFO > org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103 - Creating > new http client, > config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false > 2013-12-05 13:52:01,203 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:<init>:211 - No value set for > 'pollInterval'. Timer Task not started. > 2013-12-05 13:52:01,209 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:369 - Master's > generation: 4382 > 2013-12-05 13:52:01,209 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:370 - Slave's > generation: 4382 > 2013-12-05 13:52:01,209 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:371 - Starting > replication process > 2013-12-05 13:52:01,213 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:376 - Number of files in > latest index in master: 158 > 2013-12-05 13:52:01,213 [RecoveryThread] INFO > org.apache.solr.core.CachingDirectoryFactory:get:360 - return new directory > for > /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213 > 2013-12-05 13:52:01,213 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:402 - Starting download > to > org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213 > lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7 > fullCopy=true > 2013-12-05 13:52:21,970 [main-EventThread] INFO > org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state > change: WatchedEvent state:SyncConnected type:NodeDataChanged > path:/clusterstate.json, has occurred - updating... (live nodes size: 18) > 2013-12-05 13:55:15,039 [main-EventThread] INFO > org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state > change: WatchedEvent state:SyncConnected type:NodeDataChanged > path:/clusterstate.json, has occurred - updating... (live nodes size: 18) > 2013-12-05 13:56:27,160 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:406 - Total time taken > for download : 265 secs > 2013-12-05 13:56:27,540 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:modifyIndexProps:886 - New index > installed. Updating index properties... index=index.20131205135201213 > 2013-12-05 13:56:27,546 [RecoveryThread] INFO > org.apache.solr.handler.SnapPuller:fetchLatestIndex:461 - removing old index > directory > org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724 > lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838 > 2013-12-05 13:56:27,765 [RecoveryThread] INFO > org.apache.solr.search.SolrIndexSearcher:<init>:197 - Opening > Searcher@30a544f3 main > 2013-12-05 13:58:52,337 [RecoveryThread] INFO > org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235 - looking > to close > /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>] > 2013-12-05 13:58:52,337 [RecoveryThread] INFO > org.apache.solr.core.CachingDirectoryFactory:close:304 - Closing directory: > /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724 > 2013-12-05 13:58:52,337 [RecoveryThread] INFO > org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279 - Removing > directory before core close: > /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724 > 2013-12-05 13:58:54,172 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:replay:506 - Replaying buffered > documents. core=fr_green > 2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN > org.apache.solr.update.UpdateLog:doReplay:1240 - Starting log replay > tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091 > refcount=2} active=true starting pos=1103833347 > 2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO > org.apache.solr.search.SolrIndexSearcher:<init>:197 - Opening > Searcher@5e48fb12 main > 2013-12-05 14:01:11,908 [main-EventThread] INFO > org.apache.solr.common.cloud.ZkStateReader:process:210 - A cluster state > change: WatchedEvent state:SyncConnected type:NodeDataChanged > path:/clusterstate.json, has occurred - updating... (live nodes size: 18) > 2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN > org.apache.solr.update.UpdateLog:run:1230 - Log replay finished. > recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0 > positionOfStart=1103833347} > 2013-12-05 14:01:22,846 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:410 - Replication Recovery > was successful - registering as Active. core=fr_green > 2013-12-05 14:01:22,846 [RecoveryThread] INFO > org.apache.solr.cloud.ZkController:publish:1017 - publishing core=fr_green > state=active > 2013-12-05 14:01:22,850 [RecoveryThread] INFO > org.apache.solr.cloud.ZkController:publish:1021 - numShards not found on > descriptor - reading it from system property > 2013-12-05 14:01:22,860 [RecoveryThread] INFO > org.apache.solr.cloud.RecoveryStrategy:doRecovery:495 - Finished recovery > process. core=fr_green > > Best regards, > Anca Kopetz > -- > <generic-logo-no-strap.gif> > Anca Kopetz Software engineer > > E anca.kop...@kelkoo.com Y!Messenger kelkooancak > T +33 (0)4 56 09 07 55 > A 4/6 Rue des Meridiens 38130 Echirolles > <x-bottom-left.gif> <x-bottom-middle.gif> > Kelkoo SAS > Société par Actions Simplifiée > Au capital de € 4.168.964,30 > Siège social : 8, rue du Sentier 75002 Paris > 425 093 069 RCS Paris > > Ce message et les pièces jointes sont confidentiels et établis à l'attention > exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce > message, merci de le détruire et d'en avertir l'expéditeur.